FPN - Feature Pyramid Networks for Object Detection

Words List (appearance)
#	word	phonetic	sentence
1	pyramidal	['pɪrəmɪdl]	In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. 在本文中，我们利用深度卷积网络内在的多尺度、金字塔分级来构造具有很少额外成本的特征金字塔。 (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. （c）另一种方法是重用ConvNet计算的金字塔特征层次结构，就好像它是一个特征化的图像金字塔。 A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. 深层ConvNet逐层计算特征层级，而对于下采样层，特征层级具有内在的多尺度金字塔形状。 The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)). 单次检测器（SSD）[22]是首先尝试使用ConvNet的金字塔特征层级中的一个，好像它是一个特征化的图像金字塔（图1（c））。 The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. 本文的目标是自然地利用ConvNet特征层级的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。 Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2. 尽管这些方法采用的是金字塔形状的架构，但它们不同于特征化的图像金字塔[5，7，34]，其中所有层次上的预测都是独立进行的，参见图2。 In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28]. 事实上，对于图2（顶部）中的金字塔结构，图像金字塔仍然需要跨多个尺度上识别目标[28]。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. 我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。 This architecture simulates the effect of reusing the pyramidal feature hierarchy (Fig. 1(b)). 该架构模拟了重用金字塔特征层次结构的效果（图1（b））。
2	lateral	[ˈlætərəl]	A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. 开发了一种具有横向连接的自顶向下架构，用于在所有尺度上构建高级语义特征映射。 To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). 为了实现这个目标，我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。（图1（d））。 There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. 最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。 The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following. 如下所述，我们的金字塔结构包括自下而上的路径，自上而下的路径和横向连接。 Top-down pathway and lateral connections. 自顶向下的路径和横向连接。 These features are then enhanced with features from the bottom-up pathway via lateral connections. 这些特征随后通过来自自下而上路径上的特征经由横向连接进行增强。 Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. 每个横向连接合并来自自下而上路径和自顶向下路径的具有相同空间大小的特征映射。 A building block illustrating the lateral connection and the top-down pathway, merged by addition. 构建模块说明了横向连接和自顶向下路径，通过加法合并。 The columns “lateral” and “top-down” denote the presence of lateral and top-down connections, respectively. 列“lateral”和“top-down”分别表示横向连接和自上而下连接的存在。 The columns “lateral” and “top-down” denote the presence of lateral and top-down connections, respectively. 列“lateral”和“top-down”分别表示横向连接和自上而下连接的存在。 With this modification, the 1×1 lateral connections followed by 3×3 convolutions are attached to the bottom-up pyramid. 通过这种修改，将1×1横向连接和后面的3×3卷积添加到自下而上的金字塔中。 How important are lateral connections? Table 1(e) shows the ablation results of a top-down feature pyramid without the 1×1 lateral connections. 横向连接有多重要？表1（e）显示了没有1×1横向连接的自顶向下特征金字塔的消融结果。 How important are lateral connections? Table 1(e) shows the ablation results of a top-down feature pyramid without the 1×1 lateral connections. 横向连接有多重要？表1（e）显示了没有1×1横向连接的自顶向下特征金字塔的消融结果。 More precise locations of features can be directly passed from the finer levels of the bottom-up maps via the lateral connections to the top-down maps. 更精确的特征位置可以通过横向连接直接从自下而上映射的更精细层级传递到自上而下的映射。 Table 2(d) and (e) show that removing top-down connections or removing lateral connections leads to inferior results, similar to what we have observed in the above sub-section for RPN. 表2（d）和（e）表明，去除自上而下的连接或去除横向连接会导致较差的结果，类似于我们在上面的RPN小节中观察到的结果。
3	semantic	[sɪˈmæntɪk]	A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. 开发了一种具有横向连接的自顶向下架构，用于在所有尺度上构建高级语义特征映射。 Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). 除了能够表示更高级别的语义，ConvNets对于尺度变化也更加鲁棒，从而有助于从单一输入尺度上计算的特征进行识别[15，11，29]（图1（b））。 This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. 这种网内特征层级产生不同空间分辨率的特征映射，但引入了由不同深度引起的较大的语义差异。 The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. 本文的目标是自然地利用ConvNet特征层级的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。 The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale. 其结果是一个特征金字塔，在所有级别都具有丰富的语义，并且可以从单个输入图像尺度上进行快速构建。 FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations. FCN[24]将多个尺度上的每个类别的部分分数相加以计算语义分割。 There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. 最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. 我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. 我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。 The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times. 自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。 The good performance of sharing parameters indicates that all levels of our pyramid share similar semantic levels. 共享参数的良好性能表明我们的金字塔的所有层级共享相似的语义级别。 Table 1 (b) shows no advantage over (a), indicating that a single higher-level feature map is not enough because there is a trade-off between coarser resolutions and stronger semantics. 表1（b）显示没有优于（a），这表明单个更高级别的特征映射是不够的，因为存在在较粗分辨率和较强语义之间的权衡。 We conjecture that this is because there are large semantic gaps between different levels on the bottom-up pyramid (Fig. 1(b)), especially for very deep ResNets. 我们推测这是因为自下而上的金字塔（图1（b））的不同层次之间存在较大的语义差距，尤其是对于非常深的ResNets。 This top-down pyramid has strong semantic features and fine resolutions. 这个自顶向下的金字塔具有强大的语义特征和良好的分辨率。 How important are pyramid representations? Instead of resorting to pyramid representations, one can attach the head to the highest-resolution, strongly semantic feature maps of $P_2$ (i.e., the finest level in our pyramids). 金字塔表示有多重要？可以将头部附加到$P_2$的最高分辨率的强语义特征映射上（即我们金字塔中的最好层级），而不采用金字塔表示。
4	FPN	[!≈ ef pi: en]	This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. 这种称为特征金字塔网络（FPN）的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。 Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. 在一个基本的Faster R-CNN系统中使用FPN，没有任何不必要的东西，我们的方法可以在COCO检测基准数据集上取得最先进的单模型结果，结果超过了所有现有的单模型输入，包括COCO 2016挑战赛的获奖者。 (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate. （d）我们提出的特征金字塔网络（FPN）与（b）和（c）类似，但更准确。 We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]. 我们评估了我们称为特征金字塔网络（FPN）的方法，其在各种系统中用于检测和分割[11，29，27]。 Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. 没有任何不必要的东西，我们在具有挑战性的COCO检测基准数据集上报告了最新的单模型结果，仅仅基于FPN和基本的Faster R-CNN检测器[29]，就超过了竞赛获奖者所有现存的严重工程化的单模型竞赛输入。 In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. 在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 As a result, FPNs are able to achieve higher accuracy than all existing state-of-the-art methods. 因此，FPN能够比所有现有的最先进方法获得更高的准确度。 We also generalize FPNs to instance segmentation proposals in Sec.6. 在第6节中我们还将FPN泛化到实例细分提议。 We adapt RPN by replacing the single-scale feature map with our FPN. 我们通过用我们的FPN替换单尺度特征映射来适应RPN。 With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29]. 通过上述改编，RPN可以自然地通过我们的FPN进行训练和测试，与[29]中的方式相同。 To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels. 要将其与我们的FPN一起使用，我们需要为金字塔等级分配不同尺度的RoI。 Training RPN with FPN on 8 GPUs takes about 8 hours on COCO. 使用具有FPN的RPN在8个GPU上训练COCO数据集需要约8小时。 Placing FPN in RPN improves $AR^{1k}$ to 56.3 (Table 1 (c)), which is 8.0 points increase over the single-scale RPN baseline (Table 1 (a)). 将FPN放在RPN中可将$AR^{1k}$提高到56.3（表1（c）），这比单尺度RPN基线（表1（a））增加了8.0个点。 As a results, FPN has an $AR^1k$ score 10 points higher than Table 1(e). 因此，FPN的$AR^1k$的得分比表1（e）高10个点。 Next we investigate FPN for region-based (non-sliding window) detectors. 接下来我们研究基于区域（非滑动窗口）检测器的FPN。 Training Fast R-CNN with FPN takes about 10 hours on the COCO dataset. 使用FPN在COCO数据集上训练Fast R-CNN需要约10小时。 To better investigate FPN’s effects on the region-based detector alone, we conduct ablations of Fast R-CNN on a fixed set of proposals. 为了更好地调查FPN对仅基于区域的检测器的影响，我们在一组固定的提议上进行Fast R-CNN的消融。 We choose to freeze the proposals as computed by RPN on FPN (Table 1(c)), because it has good performance on small objects that are to be recognized by the detector. 我们选择冻结RPN在FPN上计算的提议（表1（c）），因为它在能被检测器识别的小目标上具有良好的性能。 Table 2(c) shows the results of our FPN in Fast R-CNN. 表2（c）显示了Fast R-CNN中我们的FPN结果。 Under controlled settings, our FPN (Table 3(c)) is better than this strong baseline by 2.3 points AP and 3.8 points AP@0.5. 在受控的环境下，我们的FPN（表3（c））比这个强劲的基线要好2.3个点的AP和3.8个点的AP@0.5。 More object detection results using Faster R-CNN and our FPNs, evaluated on minival. 使用Faster R-CNN和我们的FPN在minival上的更多目标检测结果。 Our method introduces small extra cost by the extra layers in the FPN, but has a lighter weight head. 我们的方法通过FPN中的额外层引入了较小的额外成本，但具有更轻的头部。 Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further. 此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。 Recently, FPN has enabled new top results in all tracks of the COCO competition, including detection, instance segmentation, and keypoint estimation. 最近，FPN在COCO竞赛的所有方面都取得了新的最佳结果，包括检测，实例分割和关键点估计。 In this section we use FPNs to generate segmentation proposals, following the DeepMask/SharpMask framework [27, 28]. 在本节中，我们使用FPN生成分割建议，遵循DeepMask/SharpMask框架[27，28]。 It is easy to adapt FPN to generate mask proposals. 改编FPN生成掩码提议很容易。 FPN for object segment proposals. 目标分割提议的FPN。 Our baseline FPN model with a single 5×5 MLP achieves an AR of 43.4. 我们的具有单个5×5MLP的基线FPN模型达到了43.4的AR。 DeepMask, SharpMask, and FPN use ResNet-50 while Instance-FCN uses VGG-16. DeepMask，SharpMask和FPN使用ResNet-50，而Instance-FCN使用VGG-16。 Our approach, based on FPNs, is substantially faster (our models run at 6 to 7 FPS). 我们的方法基于FPN，速度明显加快（我们的模型运行速度为6至7FPS）。
5	generic	[dʒəˈnerɪk]	This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. 这种称为特征金字塔网络（FPN）的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。 Our method is a generic solution for building feature pyramids inside deep ConvNets. 我们的方法是在深度ConvNets内部构建特征金字塔的通用解决方案。 Our method is a generic pyramid representation and can be used in applications other than object detection. 我们的方法是一种通用金字塔表示，可用于除目标检测之外的其他应用。 These results demonstrate that our model is a generic feature extractor and can replace image pyramids for other multi-scale detection problems. 这些结果表明，我们的模型是一个通用的特征提取器，可以替代图像金字塔以用于其他多尺度检测问题。
6	extractor	[ɪkˈstræktə(r)]	This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. 这种称为特征金字塔网络（FPN）的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。 These results demonstrate that our model is a generic feature extractor and can replace image pyramids for other multi-scale detection problems. 这些结果表明，我们的模型是一个通用的特征提取器，可以替代图像金字塔以用于其他多尺度检测问题。
7	surpass	[səˈpɑ:s]	Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. 在一个基本的Faster R-CNN系统中使用FPN，没有任何不必要的东西，我们的方法可以在COCO检测基准数据集上取得最先进的单模型结果，结果超过了所有现有的单模型输入，包括COCO 2016挑战赛的获奖者。 Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. 没有任何不必要的东西，我们在具有挑战性的COCO检测基准数据集上报告了最新的单模型结果，仅仅基于FPN和基本的Faster R-CNN检测器[29]，就超过了竞赛获奖者所有现存的严重工程化的单模型竞赛输入。 Table 4 compares our method with the single-model results of the COCO competition winners, including the 2016 winner G-RMI and the 2015 winner Faster R-CNN+++. Without adding bells and whistles, our single-model entry has surpassed these strong, heavily engineered competitors. 表4将我们方法的单模型结果与COCO竞赛获胜者的结果进行了比较，其中包括2016年冠军G-RMI和2015年冠军Faster R-CNN+++。
8	FPS	['efp'i:'es]	In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. 此外，我们的方法可以在GPU上以6FPS运行，因此是多尺度目标检测的实用和准确的解决方案。 Our approach, based on FPNs, is substantially faster (our models run at 6 to 7 FPS). 我们的方法基于FPN，速度明显加快（我们的模型运行速度为6至7FPS）。
9	featurize	['fi:tʃәraiz]	Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). 建立在图像金字塔之上的特征金字塔（我们简称为特征化图像金字塔）构成了标准解决方案的基础[1]（图1（a））。 (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. （c）另一种方法是重用ConvNet计算的金字塔特征层次结构，就好像它是一个特征化的图像金字塔。 Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. 特征化图像金字塔在手工设计的时代被大量使用[5，25]。 All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). 在ImageNet[33]和COCO[21]检测挑战中，最近的所有排名靠前的输入都使用了针对特征化图像金字塔的多尺度测试（例如[16，35]）。 For these reasons, Fast and Faster R-CNN [11, 29] opt to not use featurized image pyramids under default settings. 出于这些原因，Fast和Faster R-CNN[11，29]选择在默认设置下不使用特征化图像金字塔。 The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)). 单次检测器（SSD）[22]是首先尝试使用ConvNet的金字塔特征层级中的一个，好像它是一个特征化的图像金字塔（图1（c））。 In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory. 换句话说，我们展示了如何创建网络中的特征金字塔，可以用来代替特征化的图像金字塔，而不牺牲表示能力，速度或内存。 Our model echoes a featurized image pyramid, which has not been explored in these works. 我们的模型反映了一个特征化的图像金字塔，这在这些研究中还没有探索过。 There has also been significant interest in computing featurized image pyramids quickly. 这对快速计算特征化图像金字塔也很有意义。 Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2. 尽管这些方法采用的是金字塔形状的架构，但它们不同于特征化的图像金字塔[5，7，34]，其中所有层次上的预测都是独立进行的，参见图2。 Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d) in all the feature maps. 由于金字塔的所有层都像传统的特征图像金字塔一样使用共享分类器/回归器，因此我们在所有特征映射中固定特征维度（通道数记为d）。 This advantage is analogous to that of using a featurized image pyramid, where a common head classifier can be applied to features computed at any image scale. 这个优点类似于使用特征图像金字塔的优点，其中可以将常见头部分类器应用于在任何图像尺度下计算的特征。
10	scale-invariant	[!≈ skeɪl ɪnˈveəriənt]	These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. 这些金字塔是尺度不变的，因为目标的尺度变化是通过在金字塔中移动它的层级来抵消的。
11	ConvNet		(c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. （c）另一种方法是重用ConvNet计算的金字塔特征层次结构，就好像它是一个特征化的图像金字塔。 For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. 对于识别任务，工程特征大部分已经被深度卷积网络（ConvNets）[19，20]计算的特征所取代。 Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). 除了能够表示更高级别的语义，ConvNets对于尺度变化也更加鲁棒，从而有助于从单一输入尺度上计算的特征进行识别[15，11，29]（图1（b））。 A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. 深层ConvNet逐层计算特征层级，而对于下采样层，特征层级具有内在的多尺度金字塔形状。 The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)). 单次检测器（SSD）[22]是首先尝试使用ConvNet的金字塔特征层级中的一个，好像它是一个特征化的图像金字塔（图1（c））。 The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. 本文的目标是自然地利用ConvNet特征层级的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。 Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales. 在HOG和SIFT之前，使用ConvNet[38，32]的早期人脸检测工作计算了图像金字塔上的浅网络，以检测跨尺度的人脸。 Deep ConvNet object detectors. Deep ConvNet目标检测器。 With the development of modern deep ConvNets [19], object detectors like OverFeat [34] and R-CNN [12] showed dramatic improvements in accuracy. 随着现代深度卷积网络[19]的发展，像OverFeat[34]和R-CNN[12]这样的目标检测器在精度上显示出了显著的提高。 OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid. OverFeat采用了一种类似于早期神经网络人脸检测器的策略，通过在图像金字塔上应用ConvNet作为滑动窗口检测器。 R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet. R-CNN采用了基于区域提议的策略[37]，其中每个提议在用ConvNet进行分类之前都进行了尺度归一化。 A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. 一些最近的方法通过使用ConvNet中的不同层来改进检测和分割。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. 我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。 The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. 自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。 Our method is a generic solution for building feature pyramids inside deep ConvNets. 我们的方法是在深度ConvNets内部构建特征金字塔的通用解决方案。 We have presented a clean and simple framework for building feature pyramids inside ConvNets. 我们提出了一个干净而简单的框架，用于在ConvNets内部构建特征金字塔。 Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multi-scale problems using pyramid representations. 最后，我们的研究表明，尽管深层ConvNets具有强大的表示能力以及它们对尺度变化的隐式鲁棒性，但使用金字塔表示对于明确地解决多尺度问题仍然至关重要。
12	semantically	[sɪ'mæntɪklɪ]	In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features. 在该图中，特征映射用蓝色轮廓表示，较粗的轮廓表示语义上较强的特征。 The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels. 对图像金字塔的每个层次进行特征化的主要优势在于它产生了多尺度的特征表示，其中所有层次上在语义上都很强，包括高分辨率层。 To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). 为了实现这个目标，我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。（图1（d））。 To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). 为了实现这个目标，我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。（图1（d））。 The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. 自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。
13	hand-engineered	[!≈ hænd 'endʒɪn'ɪərd]	Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. 特征化图像金字塔在手工设计的时代被大量使用[5，25]。 Hand-engineered features and early neural networks. 手工设计特征和早期神经网络。
14	DPM	[!≈ di: pi: em]	They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). 它们非常关键，以至于像DPM[7]这样的目标检测器需要密集的尺度采样才能获得好的结果（例如每组10个尺度，octave含义参考SIFT特征）。
15	e.g.	[ˌi: ˈdʒi:]	They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). 它们非常关键，以至于像DPM[7]这样的目标检测器需要密集的尺度采样才能获得好的结果（例如每组10个尺度，octave含义参考SIFT特征）。 All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). 在ImageNet[33]和COCO[21]检测挑战中，最近的所有排名靠前的输入都使用了针对特征化图像金字塔的多尺度测试（例如[16，35]）。 Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. 推断时间显著增加（例如，四倍[11]），使得这种方法在实际应用中不切实际。 But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4_3 of VGG nets [36]) and then by adding several new layers. 但为了避免使用低级特征，SSD放弃重用已经计算好的图层，而从网络中的最高层开始构建金字塔（例如，VGG网络的conv4_3[36]），然后添加几个新层。 On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). 相反，我们的方法利用这个架构作为特征金字塔，其中预测（例如目标检测）在每个级别上独立进行（图2底部）。 Top: a top-down architecture with skip connections, where predictions are made on the finest level (e.g., [28]). 顶部：带有跳跃连接的自顶向下的架构，在最好的级别上进行预测（例如，[28]）。 This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]. 这个过程独立于主卷积体系结构（例如[19，36，16]），在本文中，我们呈现了使用ResNets[16]的结果。 We have experimented with more sophisticated blocks (e.g., using multi-layer residual blocks [16] as the connections) and observed marginally better results. 我们已经尝试了更复杂的块（例如，使用多层残差块[16]作为连接）并观察到稍微更好的结果。 Existing mask proposal methods [27, 28, 4] are based on densely sampled image pyramids (e.g., scaled by 2^{\lbrace −2:0.5:1 \rbrace} in [27, 28]), making them computationally expensive. 现有的掩码提议方法[27，28，4]是基于密集采样的图像金字塔的（例如，[27，28]中的缩放为2^{\lbrace −2:0.5:1 \rbrace}），使得它们是计算昂贵的。
16	octave	[ˈɒktɪv]	They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). 它们非常关键，以至于像DPM[7]这样的目标检测器需要密集的尺度采样才能获得好的结果（例如每组10个尺度，octave含义参考SIFT特征）。 Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27, 28], we use a second MLP of input size 7×7 to handle half octaves. 此外，由于在[27,28]的图像金字塔中每组使用2个尺度，我们使用输入大小为7×7的第二个MLP来处理半个组。 Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27, 28], we use a second MLP of input size 7×7 to handle half octaves. 此外，由于在[27,28]的图像金字塔中每组使用2个尺度，我们使用输入大小为7×7的第二个MLP来处理半个组。 Half octaves are handled by an MLP on 7x7 windows ($7 \approx 5 \sqrt 2$), not shown here. 半个组由MLP在7x7窗口（ $7 \ approx 5 \ sqrt 2 $）处理，此处未展示。
17	higher-level	[!≈ ˈhaɪə(r) ˈlevl]	Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). 除了能够表示更高级别的语义，ConvNets对于尺度变化也更加鲁棒，从而有助于从单一输入尺度上计算的特征进行识别[15，11，29]（图1（b））。 Table 1 (b) shows no advantage over (a), indicating that a single higher-level feature map is not enough because there is a trade-off between coarser resolutions and stronger semantics. 表1（b）显示没有优于（a），这表明单个更高级别的特征映射是不够的，因为存在在较粗分辨率和较强语义之间的权衡。
18	variance	[ˈveəriəns]	Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). 除了能够表示更高级别的语义，ConvNets对于尺度变化也更加鲁棒，从而有助于从单一输入尺度上计算的特征进行识别[15，11，29]（图1（b））。 RPN is a sliding window detector with a fixed window size, so scanning over pyramid levels can increase its robustness to scale variance. RPN是一个具有固定窗口大小的滑动窗口检测器，因此在金字塔层级上扫描可以增加其对尺度变化的鲁棒性。
19	robustness	[rəʊ'bʌstnəs]	But even with this robustness, pyramids are still needed to get the most accurate results. 但即使有这种鲁棒性，金字塔仍然需要得到最准确的结果。 Our pyramid representation greatly improves RPN’s robustness to object scale variation. 我们的金字塔表示大大提高了RPN对目标尺度变化的鲁棒性。 RPN is a sliding window detector with a fixed window size, so scanning over pyramid levels can increase its robustness to scale variance. RPN是一个具有固定窗口大小的滑动窗口检测器，因此在金字塔层级上扫描可以增加其对尺度变化的鲁棒性。 Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multi-scale problems using pyramid representations. 最后，我们的研究表明，尽管深层ConvNets具有强大的表示能力以及它们对尺度变化的隐式鲁棒性，但使用金字塔表示对于明确地解决多尺度问题仍然至关重要。
20	featurizing		The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels. 对图像金字塔的每个层次进行特征化的主要优势在于它产生了多尺度的特征表示，其中所有层次上在语义上都很强，包括高分辨率层。 Nevertheless, featurizing each level of an image pyramid has obvious limitations. 尽管如此，特征化图像金字塔的每个层次都具有明显的局限性。
21	impractical	[ɪmˈpræktɪkl]	Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. 推断时间显著增加（例如，四倍[11]），使得这种方法在实际应用中不切实际。
22	infeasible	[ɪn'fi:zəbl]	Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference. 此外，在图像金字塔上端对端地训练深度网络在内存方面是不可行的，所以如果被采用，图像金字塔仅在测试时被使用[15，11，16，35]，这造成了训练/测试时推断的不一致性。
23	inconsistency	[ˌɪnkən'sɪstənsɪ]	Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference. 此外，在图像金字塔上端对端地训练深度网络在内存方面是不可行的，所以如果被采用，图像金字塔仅在测试时被使用[15，11，16，35]，这造成了训练/测试时推断的不一致性。
24	subsampling		A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. 深层ConvNet逐层计算特征层级，而对于下采样层，特征层级具有内在的多尺度金字塔形状。
25	in-network	[!≈ ɪn ˈnetwɜ:k]	This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. 这种网内特征层级产生不同空间分辨率的特征映射，但引入了由不同深度引起的较大的语义差异。 In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory. 换句话说，我们展示了如何创建网络中的特征金字塔，可以用来代替特征化的图像金字塔，而不牺牲表示能力，速度或内存。
26	representational	[ˌreprɪzenˈteɪʃnl]	The high-resolution maps have low-level features that harm their representational capacity for object recognition. 高分辨率映射具有损害其目标识别表示能力的低级特征。 In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory. 换句话说，我们展示了如何创建网络中的特征金字塔，可以用来代替特征化的图像金字塔，而不牺牲表示能力，速度或内存。 Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multi-scale problems using pyramid representations. 最后，我们的研究表明，尽管深层ConvNets具有强大的表示能力以及它们对尺度变化的隐式鲁棒性，但使用金字塔表示对于明确地解决多尺度问题仍然至关重要。
27	Ideally	[aɪ'di:əlɪ]	Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. 理想情况下，SSD风格的金字塔将重用正向传递中从不同层中计算的多尺度特征映射，因此是零成本的。
28	SSD-style		Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. 理想情况下，SSD风格的金字塔将重用正向传递中从不同层中计算的多尺度特征映射，因此是零成本的。
29	forego	[fɔ:ˈɡəu]	But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4_3 of VGG nets [36]) and then by adding several new layers. 但为了避免使用低级特征，SSD放弃重用已经计算好的图层，而从网络中的最高层开始构建金字塔（例如，VGG网络的conv4_3[36]），然后添加几个新层。
30	higher-resolution	[!≈ ˈhaɪə(r) ˌrezəˈlu:ʃn]	Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. 因此它错过了重用特征层级的更高分辨率映射的机会。
31	leverage	[ˈli:vərɪdʒ]	The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. 本文的目标是自然地利用ConvNet特征层级的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。 On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). 相反，我们的方法利用这个架构作为特征金字塔，其中预测（例如目标检测）在每个级别上独立进行（图2底部）。 Bottom: our model that has a similar structure but leverages it as a feature pyramid, with predictions made independently at all levels. 底部：我们的模型具有类似的结构，但将其用作特征金字塔，并在各个层级上独立进行预测。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. 我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。
32	pathway	[ˈpɑ:θweɪ]	To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). 为了实现这个目标，我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。（图1（d））。 The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following. 如下所述，我们的金字塔结构包括自下而上的路径，自上而下的路径和横向连接。 The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following. 如下所述，我们的金字塔结构包括自下而上的路径，自上而下的路径和横向连接。 Bottom-up pathway. 自下而上的路径。 The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. 自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。 Top-down pathway and lateral connections. 自顶向下的路径和横向连接。 The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. 自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。 These features are then enhanced with features from the bottom-up pathway via lateral connections. 这些特征随后通过来自自下而上路径上的特征经由横向连接进行增强。 Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. 每个横向连接合并来自自下而上路径和自顶向下路径的具有相同空间大小的特征映射。 Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. 每个横向连接合并来自自下而上路径和自顶向下路径的具有相同空间大小的特征映射。 A building block illustrating the lateral connection and the top-down pathway, merged by addition. 构建模块说明了横向连接和自顶向下路径，通过加法合并。 How important is top-down enrichment? Table 1(d) shows the results of our feature pyramid without the top-down pathway. 自上而下的改进有多重要？表1（d）显示了没有自上而下路径的特征金字塔的结果。
33	heavily-engineered	[!≈ ˈhevɪli 'endʒɪn'ɪərd]	Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. 没有任何不必要的东西，我们在具有挑战性的COCO检测基准数据集上报告了最新的单模型结果，仅仅基于FPN和基本的Faster R-CNN检测器[29]，就超过了竞赛获奖者所有现存的严重工程化的单模型竞赛输入。
34	ablation	[əˈbleɪʃn]	In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. 在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 We train using the union of 80k train images and a 35k subset of val images (trainval35k [2]), and report ablations on a 5k subset of val images (minival). 我们训练使用80k张训练图像和35k大小的验证图像子集（trainval35k[2]）的联合，并报告了在5k大小的验证图像子集（minival）上的消融实验。 5.1.1 Ablation Experiments 5.1.1 消融实验 How important are lateral connections? Table 1(e) shows the ablation results of a top-down feature pyramid without the 1×1 lateral connections. 横向连接有多重要？表1（e）显示了没有1×1横向连接的自顶向下特征金字塔的消融结果。 To better investigate FPN’s effects on the region-based detector alone, we conduct ablations of Fast R-CNN on a fixed set of proposals. 为了更好地调查FPN对仅基于区域的检测器的影响，我们在一组固定的提议上进行Fast R-CNN的消融。
35	bounding	[baundɪŋ]	In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. 在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 In the following we adopt our method in RPN [29] for bounding box proposal generation and in Fast R-CNN [11] for object detection. 在下面，我们采用我们的方法在RPN[29]中进行边界框提议生成，并在Fast R-CNN[11]中进行目标检测。 In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a single-scale convolutional feature map, performing object/non-object binary classification and bounding box regression. 在原始的RPN设计中，一个小型子网络在密集的3×3滑动窗口，单尺度卷积特征映射上进行评估，执行目标/非目标的二分类和边界框回归。 The object/non-object criterion and bounding box regression target are defined with respect to a set of reference boxes called anchors [29]. 目标/非目标标准和边界框回归目标的定义是关于一组称为锚点的参考框的[29]。 We assign training labels to the anchors based on their Intersection-over-Union (IoU) ratios with ground-truth bounding boxes as in [29]. 如[29]，我们根据锚点和实际边界框的交并比（IoU）比例将训练标签分配给锚点。 We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels. 我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。 So unlike [16], we simply adopt RoI pooling to extract 7×7 features, and attach two hidden 1,024-d fully-connected (fc) layers (each followed by ReLU) before the final classification and bounding box regression layers. 因此，与[16]不同，我们只是采用RoI池化提取7×7特征，并在最终的分类层和边界框回归层之前附加两个隐藏单元为1024维的全连接（fc）层（每层后都接ReLU层）。 Bounding box proposal results using RPN [29], evaluated on the COCO minival set. 使用RPN[29]的边界框提议结果，在COCO的minival数据集上进行评估。
36	COCO-style	[!≈ 'kəʊkəʊ staɪl]	In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. 在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 We evaluate the COCO-style Average Recall (AR) and AR on small, medium, and large objects ($AR_s$, $AR_m$, and $AR_l$) following the definitions in [21]. 根据[21]中的定义，我们评估了COCO类型的平均召回率（AR）和在小型，中型和大型目标($AR_s$, $AR_m$, and $AR_lv)上的AR。 We evaluate object detection by the COCO-style Average Precision (AP) and PASCAL-style AP (at a single IoU threshold of 0.5). 我们通过COCO类型的平均精度（AP）和PASCAL类型的AP（单个IoU阈值为0.5）来评估目标检测。
37	PASCAL-style	[!≈ 'pæskәl staɪl]	In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. 在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 We evaluate object detection by the COCO-style Average Precision (AP) and PASCAL-style AP (at a single IoU threshold of 0.5). 我们通过COCO类型的平均精度（AP）和PASCAL类型的AP（单个IoU阈值为0.5）来评估目标检测。
38	consistently	[kən'sɪstəntlɪ]	In addition, our pyramid structure can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids. 另外，我们的金字塔结构可以通过所有尺度进行端对端培训，并且在训练/测试时一致地使用，这在使用图像金字塔时是内存不可行的。
39	memory-infeasible	[!≈ ˈmeməri ɪn'fi:zəbl]	In addition, our pyramid structure can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids. 另外，我们的金字塔结构可以通过所有尺度进行端对端培训，并且在训练/测试时一致地使用，这在使用图像金字塔时是内存不可行的。
40	SIFT	[sɪft]	SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. SIFT特征[25]最初是从尺度空间极值中提取的，用于特征点匹配。 HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. HOG特征[5]，以及后来的SIFT特征，都是在整个图像金字塔上密集计算的。 These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more. 这些HOG和SIFT金字塔已在许多工作中得到了应用，用于图像分类，目标检测，人体姿势估计等。 Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales. 在HOG和SIFT之前，使用ConvNet[38，32]的早期人脸检测工作计算了图像金字塔上的浅网络，以检测跨尺度的人脸。
41	scale-space	[!≈ skeɪl speɪs]	SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. SIFT特征[25]最初是从尺度空间极值中提取的，用于特征点匹配。
42	extrema	[ɪks'tri:mə]	SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. SIFT特征[25]最初是从尺度空间极值中提取的，用于特征点匹配。
43	HOG	[hɒg]	HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. HOG特征[5]，以及后来的SIFT特征，都是在整个图像金字塔上密集计算的。 These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more. 这些HOG和SIFT金字塔已在许多工作中得到了应用，用于图像分类，目标检测，人体姿势估计等。 Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales. 在HOG和SIFT之前，使用ConvNet[38，32]的早期人脸检测工作计算了图像金字塔上的浅网络，以检测跨尺度的人脸。
44	sparsely	[spɑ:slɪ]	Dollar et al. [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Dollar等人[6]通过先计算一个稀疏采样（尺度）金字塔，然后插入缺失的层级，从而演示了快速金字塔计算。
45	interpolate	[ɪnˈtɜ:pəleɪt]	Dollar et al. [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Dollar等人[6]通过先计算一个稀疏采样（尺度）金字塔，然后插入缺失的层级，从而演示了快速金字塔计算。
46	OverFeat		With the development of modern deep ConvNets [19], object detectors like OverFeat [34] and R-CNN [12] showed dramatic improvements in accuracy. 随着现代深度卷积网络[19]的发展，像OverFeat[34]和R-CNN[12]这样的目标检测器在精度上显示出了显著的提高。 OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid. OverFeat采用了一种类似于早期神经网络人脸检测器的策略，通过在图像金字塔上应用ConvNet作为滑动窗口检测器。
47	scale-normalized	[!≈ skeɪl 'nɔ:məlaɪzd]	R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet. R-CNN采用了基于区域提议的策略[37]，其中每个提议在用ConvNet进行分类之前都进行了尺度归一化。
48	SPPnet		SPPnet [15] demonstrated that such region-based detectors could be applied much more efficiently on feature maps extracted on a single image scale. SPPnet[15]表明，这种基于区域的检测器可以更有效地应用于在单个图像尺度上提取的特征映射。
49	trade-off	[ˈtreɪdˌɔ:f, -ˌɔf]	Recent and more accurate detection methods like Fast R-CNN [11] and Faster R-CNN [29] advocate using features computed from a single scale, because it offers a good trade-off between accuracy and speed. 最近更准确的检测方法，如Fast R-CNN[11]和Faster R-CNN[29]提倡使用从单一尺度计算出的特征，因为它提供了精确度和速度之间的良好折衷。 Table 1 (b) shows no advantage over (a), indicating that a single higher-level feature map is not enough because there is a trade-off between coarser resolutions and stronger semantics. 表1（b）显示没有优于（a），这表明单个更高级别的特征映射是不够的，因为存在在较粗分辨率和较强语义之间的权衡。
50	FCN	[!≈ ef si: en]	FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations. FCN[24]将多个尺度上的每个类别的部分分数相加以计算语义分割。 Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Ghiasi等人[8]为FCN提出拉普拉斯金字塔表示，以逐步细化分割。
51	hypercolumn	[haɪpə'kɒləm]	Hypercolumns [13] uses a similar method for object instance segmentation. Hypercolumns[13]使用类似的方法进行目标实例分割。
52	HyperNet		Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. 在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。
53	ParseNet		Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. 在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。
54	ION	[ˈaɪən]	Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. 在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。
55	concatenate	[kɒn'kætɪneɪt]	Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. 在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。
56	MS-CNN		SSD [22] and MS-CNN [3] predict objects at multiple layers of the feature hierarchy without combining features or scores. SSD[22]和MS-CNN[3]可预测特征层级中多个层的目标，而不需要组合特征或分数。
57	U-Net	[!≈ ju: net]	There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. 最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。
58	SharpMask		There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. 最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。 In this section we use FPNs to generate segmentation proposals, following the DeepMask/SharpMask framework [27, 28]. 在本节中，我们使用FPN生成分割建议，遵循DeepMask/SharpMask框架[27，28]。 DeepMask/SharpMask were trained on image crops for predicting instance segments and object/non-object scores. DeepMask/SharpMask在裁剪图像上进行训练，可以预测实例块和目标/非目标分数。 DeepMask, SharpMask, and FPN use ResNet-50 while Instance-FCN uses VGG-16. DeepMask，SharpMask和FPN使用ResNet-50，而Instance-FCN使用VGG-16。 DeepMask and SharpMask performance is computed with models available from https://github.com/facebookresearch/deepmask (both are the ‘zoom’ variants). DeepMask和SharpMask性能计算的模型是从https://github.com/facebookresearch/deepmask上获得的（都是‘zoom’变体）。
59	Recombinator	[riːkəm'bɪnətə]	There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. 最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。
60	Hourglass	[ˈaʊəglɑ:s]	There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. 最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。
61	keypoint	[ki:'pɔɪnt]	There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. 最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。 Recently, FPN has enabled new top results in all tracks of the COCO competition, including detection, instance segmentation, and keypoint estimation. 最近，FPN在COCO竞赛的所有方面都取得了新的最佳结果，包括检测，实例分割和关键点估计。
62	Ghiasi		Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Ghiasi等人[8]为FCN提出拉普拉斯金字塔表示，以逐步细化分割。
63	Laplacian	[lɑ:'plɑ:siәn]	Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Ghiasi等人[8]为FCN提出拉普拉斯金字塔表示，以逐步细化分割。
64	progressively	[prəˈgresɪvli]	Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Ghiasi等人[8]为FCN提出拉普拉斯金字塔表示，以逐步细化分割。
65	general-purpose	['dʒenrəl 'pɜ:pəs]	The resulting Feature Pyramid Network is general-purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. 由此产生的特征金字塔网络是通用的，在本文中，我们侧重于滑动窗口提议（Region Proposal Network，简称RPN）[29]和基于区域的检测器（Fast R-CNN）[11]。
66	proposer	[prəˈpəʊzə(r)]	The resulting Feature Pyramid Network is general-purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. 由此产生的特征金字塔网络是通用的，在本文中，我们侧重于滑动窗口提议（Region Proposal Network，简称RPN）[29]和基于区域的检测器（Fast R-CNN）[11]。
67	RPN	[!≈ ɑ:(r) pi: en]	The resulting Feature Pyramid Network is general-purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. 由此产生的特征金字塔网络是通用的，在本文中，我们侧重于滑动窗口提议（Region Proposal Network，简称RPN）[29]和基于区域的检测器（Fast R-CNN）[11]。 In the following we adopt our method in RPN [29] for bounding box proposal generation and in Fast R-CNN [11] for object detection. 在下面，我们采用我们的方法在RPN[29]中进行边界框提议生成，并在Fast R-CNN[11]中进行目标检测。 4.1. Feature Pyramid Networks for RPN 4.1. RPN的特征金字塔网络 RPN [29] is a sliding-window class-agnostic object detector. RPN[29]是一个滑动窗口类不可知的目标检测器。 In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a single-scale convolutional feature map, performing object/non-object binary classification and bounding box regression. 在原始的RPN设计中，一个小型子网络在密集的3×3滑动窗口，单尺度卷积特征映射上进行评估，执行目标/非目标的二分类和边界框回归。 We adapt RPN by replacing the single-scale feature map with our FPN. 我们通过用我们的FPN替换单尺度特征映射来适应RPN。 With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29]. 通过上述改编，RPN可以自然地通过我们的FPN进行训练和测试，与[29]中的方式相同。 5.1. Region Proposal with RPN 5.1. 区域提议与RPN For all RPN experiments (including baselines), we include the anchor boxes that are outside the image for training, which is unlike [29] where these anchor boxes are ignored. 对于所有的RPN实验（包括基准数据集），我们都包含了图像外部的锚盒来进行训练，这不同于[29]中的忽略这些锚盒。 Training RPN with FPN on 8 GPUs takes about 8 hours on COCO. 使用具有FPN的RPN在8个GPU上训练COCO数据集需要约8小时。 Bounding box proposal results using RPN [29], evaluated on the COCO minival set. 使用RPN[29]的边界框提议结果，在COCO的minival数据集上进行评估。 For fair comparisons with original RPNs[29], we run two baselines (Table 1(a, b)) using the single-scale map of $C_4$ (the same as [16]) or $C_5$, both using the same hyper-parameters as ours, including using 5 scale anchors of $\lbrace 32^2, 64^2, 128^2, 256^2, 512^2 \rbrace$. 为了与原始RPNs[29]进行公平比较，我们使用$C_4$(与[16]相同)或$C_5$的单尺度映射运行了两个基线（表1（a，b）），都使用与我们相同的超参数，包括使用5种尺度锚点$\lbrace 32^2, 64^2, 128^2, 256^2, 512^2 \rbrace$。 Placing FPN in RPN improves $AR^{1k}$ to 56.3 (Table 1 (c)), which is 8.0 points increase over the single-scale RPN baseline (Table 1 (a)). 将FPN放在RPN中可将$AR^{1k}$提高到56.3（表1（c）），这比单尺度RPN基线（表1（a））增加了8.0个点。 Placing FPN in RPN improves $AR^{1k}$ to 56.3 (Table 1 (c)), which is 8.0 points increase over the single-scale RPN baseline (Table 1 (a)). 将FPN放在RPN中可将$AR^{1k}$提高到56.3（表1（c）），这比单尺度RPN基线（表1（a））增加了8.0个点。 Our pyramid representation greatly improves RPN’s robustness to object scale variation. 我们的金字塔表示大大提高了RPN对目标尺度变化的鲁棒性。 The results in Table 1(d) are just on par with the RPN baseline and lag far behind ours. 表1（d）中的结果与RPN基线相当，并且远远落后于我们的结果。 RPN is a sliding window detector with a fixed window size, so scanning over pyramid levels can increase its robustness to scale variance. RPN是一个具有固定窗口大小的滑动窗口检测器，因此在金字塔层级上扫描可以增加其对尺度变化的鲁棒性。 We choose to freeze the proposals as computed by RPN on FPN (Table 1(c)), because it has good performance on small objects that are to be recognized by the detector. 我们选择冻结RPN在FPN上计算的提议（表1（c）），因为它在能被检测器识别的小目标上具有良好的性能。 For simplicity we do not share features between Fast R-CNN and RPN, except when specified. 为了简单起见，我们不在Fast R-CNN和RPN之间共享特征，除非指定。 Object detection results using Fast R-CNN [11] on a fixed set of proposals (RPN, ${P_k}$, Table 1(c)), evaluated on the COCO minival set. 使用Fast R-CNN[11]在一组固定提议（RPN，${P_k}$，表1（c））上的目标检测结果，在COCO的minival数据集上进行评估。 Table 2(d) and (e) show that removing top-down connections or removing lateral connections leads to inferior results, similar to what we have observed in the above sub-section for RPN. 表2（d）和（e）表明，去除自上而下的连接或去除横向连接会导致较差的结果，类似于我们在上面的RPN小节中观察到的结果。 Despite the good accuracy of this variant, it is based on the RPN proposals of ${P_k}$ and has thus already benefited from the pyramid representation. 尽管这个变体具有很好的准确性，但它是基于${P_k}$的RPN提议的，因此已经从金字塔表示中受益。 But in a Faster R-CNN system [29], the RPN and Fast R-CNN must use the same network backbone in order to make feature sharing possible. 但是在Faster R-CNN系统中[29]，RPN和Fast R-CNN必须使用相同的骨干网络来实现特征共享。 Table 3 shows the comparisons between our method and two baselines, all using consistent backbone architectures for RPN and Fast R-CNN. 表3显示了我们的方法和两个基线之间的比较，所有这些RPN和Fast R-CNN都使用一致的骨干架构。 The backbone network for RPN are consistent with Fast R-CNN. RPN与Fast R-CNN的骨干网络是一致的。 In the above, for simplicity we do not share the features between RPN and Fast R-CNN. 在上面，为了简单起见，我们不共享RPN和Fast R-CNN之间的特征。 The two MLPs play a similar role as anchors in RPN. 这两个MLP在RPN中扮演着类似于锚点的角色。
68	Sec.6.		We also generalize FPNs to instance segmentation proposals in Sec.6. 在第6节中我们还将FPN泛化到实例细分提议。
69	arbitrary	[ˈɑ:bɪtrəri]	Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. 我们的方法以任意大小的单尺度图像作为输入，并以全卷积的方式输出多层适当大小的特征映射。
70	proportionally	[prə'pɔ:ʃənlɪ]	Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. 我们的方法以任意大小的单尺度图像作为输入，并以全卷积的方式输出多层适当大小的特征映射。
71	backbone	[ˈbækbəʊn]	This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]. 这个过程独立于主卷积体系结构（例如[19，36，16]），在本文中，我们呈现了使用ResNets[16]的结果。 The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. 自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。 As is common practice [12], all network backbones are pre-trained on the ImageNet1k classification set [33] and then fine-tuned on the detection dataset. 正如通常的做法[12]，所有的网络骨干都是在ImageNet1k分类集[33]上预先训练好的，然后在检测数据集上进行微调。 But in a Faster R-CNN system [29], the RPN and Fast R-CNN must use the same network backbone in order to make feature sharing possible. 但是在Faster R-CNN系统中[29]，RPN和Fast R-CNN必须使用相同的骨干网络来实现特征共享。 Table 3 shows the comparisons between our method and two baselines, all using consistent backbone architectures for RPN and Fast R-CNN. 表3显示了我们的方法和两个基线之间的比较，所有这些RPN和Fast R-CNN都使用一致的骨干架构。 The backbone network for RPN are consistent with Fast R-CNN. RPN与Fast R-CNN的骨干网络是一致的。
72	feed-forward	['fi:df'ɔ:wəd]	The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. 自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。
73	residual	[rɪˈzɪdjuəl]	Specifically, for ResNets [16] we use the feature activations output by each stage’s last residual block. 具体而言，对于ResNets[16]，我们使用每个阶段的最后一个残差块输出的特征激活。 We denote the output of these last residual blocks as $\lbrace C_2 , C_3 , C_4 , C_5 \rbrace$ for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of {4, 8, 16, 32} pixels with respect to the input image. 对于conv2，conv3，conv4和conv5输出，我们将这些最后残差块的输出表示为$\lbrace C_2, C_3, C_4, C_5 \rbrace$，并注意相对于输入图像它们的步长为{4，8，16，32}个像素。 We have experimented with more sophisticated blocks (e.g., using multi-layer residual blocks [16] as the connections) and observed marginally better results. 我们已经尝试了更复杂的块（例如，使用多层残差块[16]作为连接）并观察到稍微更好的结果。
74	hallucinate	[həˈlu:sɪneɪt]	The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. 自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。
75	spatially	['speɪʃəlɪ]	The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. 自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。
76	lower-level	[!≈ ˈləʊə(r) ˈlevl]	The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times. 自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。
77	localized	[ˈləʊkəlaɪzd]	The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times. 自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。
78	subsampled		The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times. 自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。
79	coarser-resolution	[!≈ kɔ:sə ˌrezəˈlu:ʃn]	With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). 使用较粗糙分辨率的特征映射，我们将空间分辨率上采样为2倍（为了简单起见，使用最近邻上采样）。
80	upsampled		The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition. 然后通过按元素相加，将上采样映射与相应的自下而上映射（其经过1×1卷积层来减少通道维度）合并。 But we argue that the locations of these features are not precise, because these maps have been downsampled and upsampled several times. 但是我们认为这些特征的位置并不精确，因为这些映射已经进行了多次下采样和上采样。
81	iterated	[ˈɪtəˌreɪtid]	This process is iterated until the finest resolution map is generated. 迭代这个过程，直到生成最佳分辨率映射。
82	append	[əˈpend]	Finally, we append a 3 × 3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. 最后，我们在每个合并的映射上添加一个3×3卷积来生成最终的特征映射，这是为了减少上采样的混叠效应。
83	alias	[ˈeɪliəs]	Finally, we append a 3 × 3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. 最后，我们在每个合并的映射上添加一个3×3卷积来生成最终的特征映射，这是为了减少上采样的混叠效应。
84	regressor	[rɪ'gresə(r)]	Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d) in all the feature maps. 由于金字塔的所有层都像传统的特征图像金字塔一样使用共享分类器/回归器，因此我们在所有特征映射中固定特征维度（通道数记为d）。 We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels. 我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。
85	non-linearity	['nɒnlaɪn'ərɪtɪ]	There are no non-linearities in these extra layers, which we have empirically found to have minor impacts. 在这些额外的层中没有非线性，我们在实验中发现这些影响很小。
86	empirically	[ɪm'pɪrɪklɪ]	There are no non-linearities in these extra layers, which we have empirically found to have minor impacts. 在这些额外的层中没有非线性，我们在实验中发现这些影响很小。
87	minor	[ˈmaɪnə(r)]	There are no non-linearities in these extra layers, which we have empirically found to have minor impacts. 在这些额外的层中没有非线性，我们在实验中发现这些影响很小。
88	marginally	[ˈmɑ:dʒɪnəli]	We have experimented with more sophisticated blocks (e.g., using multi-layer residual blocks [16] as the connections) and observed marginally better results. 我们已经尝试了更复杂的块（例如，使用多层残差块[16]作为连接）并观察到稍微更好的结果。 Its result (33.4 AP) is marginally worse than that of using all pyramid levels (33.9 AP, Table 2(c)). 其结果（33.4 AP）略低于使用所有金字塔等级（33.9 AP，表2（c））的结果。
89	minimal	[ˈmɪnɪməl]	To demonstrate the simplicity and effectiveness of our method, we make minimal modifications to the original systems of [29, 11] when adapting them to our feature pyramid. 为了证明我们方法的简洁性和有效性，我们对[29，11]的原始系统进行最小修改，使其适应我们的特征金字塔。
90	class-agnostic	[!≈ klɑ:s ægˈnɒstɪk]	RPN [29] is a sliding-window class-agnostic object detector. RPN[29]是一个滑动窗口类不可知的目标检测器。
91	subnetwork		In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a single-scale convolutional feature map, performing object/non-object binary classification and bounding box regression. 在原始的RPN设计中，一个小型子网络在密集的3×3滑动窗口，单尺度卷积特征映射上进行评估，执行目标/非目标的二分类和边界框回归。 In [16], a ResNet’s conv5 layers (a 9-layer deep subnetwork) are adopted as the head on top of the conv4 features, but our method has already harnessed conv5 to construct the feature pyramid. 在[16]中，ResNet的conv5层（9层深的子网络）被用作conv4特征之上的头部，但我们的方法已经利用了conv5来构建特征金字塔。
92	Intersection-over-Union	[!≈ ˌɪntəˈsekʃn ˈəʊvə(r) ˈju:niən]	We assign training labels to the anchors based on their Intersection-over-Union (IoU) ratios with ground-truth bounding boxes as in [29]. 如[29]，我们根据锚点和实际边界框的交并比（IoU）比例将训练标签分配给锚点。
93	analogous	[əˈnæləgəs]	This advantage is analogous to that of using a featurized image pyramid, where a common head classifier can be applied to features computed at any image scale. 这个优点类似于使用特征图像金字塔的优点，其中可以将常见头部分类器应用于在任何图像尺度下计算的特征。 Analogous to the ResNet-based Faster R-CNN system [16] that uses $C_4$ as the single-scale feature map, we set $k_0$ to 4. Intuitively, Eqn. 类似于基于ResNet的Faster R-CNN系统[16]使用$C_4$作为单尺度特征映射，我们将$k_0$设置为4。
94	adaptation	[ˌædæpˈteɪʃn]	With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29]. 通过上述改编，RPN可以自然地通过我们的FPN进行训练和测试，与[29]中的方式相同。 Based on these adaptations, we can train and test Fast R-CNN on top of the feature pyramid. 基于这些改编，我们可以在特征金字塔之上训练和测试Fast R-CNN。
95	roi	[rwɑ:]	Fast R-CNN [11] is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features. Fast R-CNN[11]是一个基于区域的目标检测器，利用感兴趣区域（RoI）池化来提取特征。 To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels. 要将其与我们的FPN一起使用，我们需要为金字塔等级分配不同尺度的RoI。 Formally, we assign an RoI of width $w$ and height $h$ (on the input image to the network) to the level $P_k$ of our feature pyramid by: 在形式上，我们通过以下公式将宽度为$w$和高度为$h$（在网络上的输入图像上）的RoI分配到特征金字塔的级别$P_k$上： Here 224 is the canonical ImageNet pre-training size, and $k_0$ is the target level on which an RoI with $w\times h=224^2$ should be mapped into. 这里224是规范的ImageNet预训练大小，而$k_0$是大小为$w \times h=224^2$的RoI应该映射到的目标级别。 (1) means that if the RoI’s scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, $k=3$). 直觉上，方程（1）意味着如果RoI的尺寸变小了（比如224的1/2），它应该被映射到一个更精细的分辨率级别（比如k=3）。 We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels. 我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。 So unlike [16], we simply adopt RoI pooling to extract 7×7 features, and attach two hidden 1,024-d fully-connected (fc) layers (each followed by ReLU) before the final classification and bounding box regression layers. 因此，与[16]不同，我们只是采用RoI池化提取7×7特征，并在最终的分类层和边界框回归层之前附加两个隐藏单元为1024维的全连接（fc）层（每层后都接ReLU层）。 Each mini-batch involves 2 image per GPU and 512 RoIs per image. 每个小批量数据包括每个GPU2张图像和每张图像上512个RoI。 We use 2000 RoIs per image for training and 1000 for testing. 我们每张图像使用2000个RoIs进行训练，1000个RoI进行测试。 As a ResNet-based Fast R-CNN baseline, following [16], we adopt RoI pooling with an output size of 14×14 and attach all conv5 layers as the hidden layers of the head. 作为基于ResNet的Fast R-CNN基线，遵循[16]，我们采用输出尺寸为14×14的RoI池化，并将所有conv5层作为头部的隐藏层。 We argue that this is because RoI pooling is a warping-like operation, which is less sensitive to the region’s scales. 我们认为这是因为RoI池化是一种扭曲式的操作，对区域尺度较不敏感。 We find the following implementations contribute to the gap: (i) We use an image scale of 800 pixels instead of 600 in [11, 16]; (ii) We train with 512 RoIs per image which accelerate convergence, in contrast to 64 RoIs in [11, 16]; (iii) We use 5 scale anchors instead of 4 in [16] (adding $32^2$); (iv) At test time we use 1000 proposals per image instead of 300 in [16]. 我们发现以下实现有助于缩小差距：（i）我们使用800像素的图像尺度，而不是[11，16]中的600像素；（ii）与[11，16]中的64个ROI相比，我们训练时每张图像有512个ROIs，可以加速收敛；（iii）我们使用5个尺度的锚点，而不是[16]中的4个（添加$32^2$）；（iv）在测试时，我们每张图像使用1000个提议，而不是[16]中的300个。 We find the following implementations contribute to the gap: (i) We use an image scale of 800 pixels instead of 600 in [11, 16]; (ii) We train with 512 RoIs per image which accelerate convergence, in contrast to 64 RoIs in [11, 16]; (iii) We use 5 scale anchors instead of 4 in [16] (adding $32^2$); (iv) At test time we use 1000 proposals per image instead of 300 in [16]. 我们发现以下实现有助于缩小差距：（i）我们使用800像素的图像尺度，而不是[11，16]中的600像素；（ii）与[11，16]中的64个ROI相比，我们训练时每张图像有512个ROIs，可以加速收敛；（iii）我们使用5个尺度的锚点，而不是[16]中的4个（添加$32^2$）；（iv）在测试时，我们每张图像使用1000个提议，而不是[16]中的300个。
96	canonical	[kəˈnɒnɪkl]	Here 224 is the canonical ImageNet pre-training size, and $k_0$ is the target level on which an RoI with $w\times h=224^2$ should be mapped into. 这里224是规范的ImageNet预训练大小，而$k_0$是大小为$w \times h=224^2$的RoI应该映射到的目标级别。 Both the corresponding image region size (light orange) and canonical object size (dark orange) are shown. 显示了相应的图像区域大小（浅橙色）和典型目标大小（深橙色）。
97	finer-resolution	[!≈ 'faɪnə ˌrezəˈlu:ʃn]	(1) means that if the RoI’s scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, $k=3$). 直觉上，方程（1）意味着如果RoI的尺寸变小了（比如224的1/2），它应该被映射到一个更精细的分辨率级别（比如k=3）。
98	predictor	[prɪˈdɪktə(r)]	We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels. 我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。
99	class-specific	[!≈ klɑ:s spəˈsɪfɪk]	We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels. 我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。
100	harness	[ˈhɑ:nɪs]	In [16], a ResNet’s conv5 layers (a 9-layer deep subnetwork) are adopted as the head on top of the conv4 features, but our method has already harnessed conv5 to construct the feature pyramid. 在[16]中，ResNet的conv5层（9层深的子网络）被用作conv4特征之上的头部，但我们的方法已经利用了conv5来构建特征金字塔。
101	MLP	[!≈ em el pi:]	Note that compared to the standard conv5 head, our 2-fc MLP head is lighter weight and faster. 请注意，与标准的conv5头部相比，我们的2-fc MLP头部更轻更快。 Table 2(b) is a baseline exploiting an MLP head with 2 hidden fc layers, similar to the head in our architecture. 表2（b）是利用MLP头部的基线，其具有2个隐藏的fc层，类似于我们的架构中的头部。 On top of each level of the feature pyramid, we apply a small 5×5 MLP to predict 14×14 masks and object scores in a fully convolutional fashion, see Fig. 4. 在特征金字塔的每个层级上，我们应用一个小的5×5MLP以全卷积方式预测14×14掩码和目标分数，参见图4。 Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27, 28], we use a second MLP of input size 7×7 to handle half octaves. 此外，由于在[27,28]的图像金字塔中每组使用2个尺度，我们使用输入大小为7×7的第二个MLP来处理半个组。 The two MLPs play a similar role as anchors in RPN. 这两个MLP在RPN中扮演着类似于锚点的角色。 We apply a small MLP on 5x5 windows to generate dense object segments with output dimension of 14x14. 我们在5x5窗口上应用一个小的MLP来生成输出尺寸为14x14的密集目标块。 Half octaves are handled by an MLP on 7x7 windows ($7 \approx 5 \sqrt 2$), not shown here. 半个组由MLP在7x7窗口（ $7 \ approx 5 \ sqrt 2 $）处理，此处未展示。 Our baseline FPN model with a single 5×5 MLP achieves an AR of 43.4. 我们的具有单个5×5MLP的基线FPN模型达到了43.4的AR。 Switching to a slightly larger 7×7 MLP leaves accuracy largely unchanged. 切换到稍大的7×7MLP，精度基本保持不变。 Using both MLPs together increases accuracy to 45.7 AR. 同时使用两个MLP将精度提高到了45.7的AR。
102	trainval35k		We train using the union of 80k train images and a 35k subset of val images (trainval35k [2]), and report ablations on a 5k subset of val images (minival). 我们训练使用80k张训练图像和35k大小的验证图像子集（trainval35k[2]）的联合，并报告了在5k大小的验证图像子集（minival）上的消融实验。 All models are trained on trainval35k. 所有模型都是通过trainval35k训练的。 Models are trained on the trainval35k set. 模型在trainval35k数据集上训练。 Models are trained on the trainval35k set and use ResNet-50. ^†Provided by authors of [16]. 模型在trainval35k数据集上训练并使用ResNet-50。^†由[16]的作者提供。
103	minival		We train using the union of 80k train images and a 35k subset of val images (trainval35k [2]), and report ablations on a 5k subset of val images (minival). 我们训练使用80k张训练图像和35k大小的验证图像子集（trainval35k[2]）的联合，并报告了在5k大小的验证图像子集（minival）上的消融实验。 Bounding box proposal results using RPN [29], evaluated on the COCO minival set. 使用RPN[29]的边界框提议结果，在COCO的minival数据集上进行评估。 Object detection results using Fast R-CNN [11] on a fixed set of proposals (RPN, ${P_k}$, Table 1(c)), evaluated on the COCO minival set. 使用Fast R-CNN[11]在一组固定提议（RPN，${P_k}$，表1（c））上的目标检测结果，在COCO的minival数据集上进行评估。 Object detection results using Faster R-CNN [29] evaluated on the COCO minival set. 使用Faster R-CNN[29]在COCOminival数据集上评估的目标检测结果。 More object detection results using Faster R-CNN and our FPNs, evaluated on minival. 使用Faster R-CNN和我们的FPN在minival上的更多目标检测结果。 This increases AP on minival to 35.6, without sharing features. 这将minival上的AP增加到了35.6，没有共享特征。 Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [40] on minival). 一些在test-std数据集上的结果是不可获得的，因此我们也包括了在test-dev上的结果（和Multipath[40]在minival上的结果）。
104	test-std		We also report final results on the standard test set (test-std) [21] which has no disclosed labels. 我们还报告了在没有公开标签的标准测试集（test-std）[21]上的最终结果。 Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [40] on minival). 一些在test-std数据集上的结果是不可获得的，因此我们也包括了在test-dev上的结果（和Multipath[40]在minival上的结果）。
105	ImageNet1k		As is common practice [12], all network backbones are pre-trained on the ImageNet1k classification set [33] and then fine-tuned on the detection dataset. 正如通常的做法[12]，所有的网络骨干都是在ImageNet1k分类集[33]上预先训练好的，然后在检测数据集上进行微调。
106	reimplementation		Our code is a reimplementation of py-faster-rcnn using Caffe2. 我们的代码是使用Caffe2重新实现py-faster-rcnn。
107	py-faster-rcnn		Our code is a reimplementation of py-faster-rcnn using Caffe2. 我们的代码是使用Caffe2重新实现py-faster-rcnn。
108	Caffe		Our code is a reimplementation of py-faster-rcnn using Caffe2. 我们的代码是使用Caffe2重新实现py-faster-rcnn。
109	resize	[ˌri:ˈsaɪz]	The input image is resized such that its shorter side has 800 pixels. 输入图像的大小调整为其较短边有800像素。 The input image is resized such that its shorter side has 800 pixels. 调整大小输入图像，使其较短边为800像素。
110	synchronize	[ˈsɪŋkrənaɪz]	We adopt synchronized SGD training on 8 GPUs. 我们采用8个GPU进行同步SGD训练。 Synchronized SGD is used to train the model on 8 GPUs. 同步SGD用于在8个GPU上训练模型。
111	SGD	['esdʒ'i:d'i:]	We adopt synchronized SGD training on 8 GPUs. 我们采用8个GPU进行同步SGD训练。 Synchronized SGD is used to train the model on 8 GPUs. 同步SGD用于在8个GPU上训练模型。
112	momentum	[məˈmentəm]	We use a weight decay of 0.0001 and a momentum of 0.9. 我们使用0.0001的权重衰减和0.9的动量。 We use a weight decay of 0.0001 and a momentum of 0.9. 我们使用0.0001的权重衰减和0.9的动量。
113	enrichment	[ɪn'rɪtʃmənt]	How important is top-down enrichment? Table 1(d) shows the results of our feature pyramid without the top-down pathway. 自上而下的改进有多重要？表1（d）显示了没有自上而下路径的特征金字塔的结果。
114	par	[pɑ:(r)]	The results in Table 1(d) are just on par with the RPN baseline and lag far behind ours. 表1（d）中的结果与RPN基线相当，并且远远落后于我们的结果。
115	lag	[læg]	The results in Table 1(d) are just on par with the RPN baseline and lag far behind ours. 表1（d）中的结果与RPN基线相当，并且远远落后于我们的结果。
116	conjecture	[kənˈdʒektʃə(r)]	We conjecture that this is because there are large semantic gaps between different levels on the bottom-up pyramid (Fig. 1(b)), especially for very deep ResNets. 我们推测这是因为自下而上的金字塔（图1（b））的不同层次之间存在较大的语义差距，尤其是对于非常深的ResNets。
117	variant	[ˈveəriənt]	We have also evaluated a variant of Table 1(d) without sharing the parameters of the heads, but observed similarly degraded performance. 我们还评估了表1（d）的一个变体，但没有分享磁头的参数，但观察到类似的性能下降。 This variant (Table 1(f)) is better than the baseline but inferior to our approach. 这个变体（表1（f））比基线要好，但不如我们的方法。 Despite the good accuracy of this variant, it is based on the RPN proposals of ${P_k}$ and has thus already benefited from the pyramid representation. 尽管这个变体具有很好的准确性，但它是基于${P_k}$的RPN提议的，因此已经从金字塔表示中受益。 DeepMask and SharpMask performance is computed with models available from https://github.com/facebookresearch/deepmask (both are the ‘zoom’ variants). DeepMask和SharpMask性能计算的模型是从https://github.com/facebookresearch/deepmask上获得的（都是‘zoom’变体）。
118	level-specific	[!≈ ˈlevl spəˈsɪfɪk]	This issue cannot be simply remedied by level-specific heads. 这个问题不能简单地由特定级别的负责人来解决。
119	downsampled		But we argue that the locations of these features are not precise, because these maps have been downsampled and upsampled several times. 但是我们认为这些特征的位置并不精确，因为这些映射已经进行了多次下采样和上采样。
120	highest-resolution	[!≈ haɪɪst ˌrezəˈlu:ʃn]	How important are pyramid representations? Instead of resorting to pyramid representations, one can attach the head to the highest-resolution, strongly semantic feature maps of $P_2$ (i.e., the finest level in our pyramids). 金字塔表示有多重要？可以将头部附加到$P_2$的最高分辨率的强语义特征映射上（即我们金字塔中的最好层级），而不采用金字塔表示。
121	i.e.	[ˌaɪ ˈi:]	How important are pyramid representations? Instead of resorting to pyramid representations, one can attach the head to the highest-resolution, strongly semantic feature maps of $P_2$ (i.e., the finest level in our pyramids). 金字塔表示有多重要？可以将头部附加到$P_2$的最高分辨率的强语义特征映射上（即我们金字塔中的最好层级），而不采用金字塔表示。
122	orthogonal	[ɔ:'θɒgənl]	It gets an AP of 28.8, indicating that the 2-fc head does not give us any orthogonal advantage over the baseline in Table 2(a). 它得到了28.8的AP，表明2-fc头部没有给我们带来任何超过表2（a）中基线的正交优势。
123	sub-section	['sʌbs'ekʃn]	Table 2(d) and (e) show that removing top-down connections or removing lateral connections leads to inferior results, similar to what we have observed in the above sub-section for RPN. 表2（d）和（e）表明，去除自上而下的连接或去除横向连接会导致较差的结果，类似于我们在上面的RPN小节中观察到的结果。
124	noteworthy	[ˈnəʊtwɜ:ði]	It is noteworthy that removing top-down connections (Table 2(d)) significantly degrades the accuracy, suggesting that Fast R-CNN suffers from using the low-level features at the high-resolution maps. 值得注意的是，去除自上而下的连接（表2（d））显著降低了准确性，表明Fast R-CNN在高分辨率映射中使用了低级特征。
125	warping-like	[!≈ 'wɔ:pɪŋ laɪk]	We argue that this is because RoI pooling is a warping-like operation, which is less sensitive to the region’s scales. 我们认为这是因为RoI池化是一种扭曲式的操作，对区域尺度较不敏感。
126	reproduction	[ˌri:prəˈdʌkʃn]	Table 3(a) shows our reproduction of the baseline Faster R-CNN system as described in [16]. 表3（a）显示了我们再现[16]中描述的Faster R-CNN系统的基线。
127	convergence	[kən'vɜ:dʒəns]	We find the following implementations contribute to the gap: (i) We use an image scale of 800 pixels instead of 600 in [11, 16]; (ii) We train with 512 RoIs per image which accelerate convergence, in contrast to 64 RoIs in [11, 16]; (iii) We use 5 scale anchors instead of 4 in [16] (adding $32^2$); (iv) At test time we use 1000 proposals per image instead of 300 in [16]. 我们发现以下实现有助于缩小差距：（i）我们使用800像素的图像尺度，而不是[11，16]中的600像素；（ii）与[11，16]中的64个ROI相比，我们训练时每张图像有512个ROIs，可以加速收敛；（iii）我们使用5个尺度的锚点，而不是[16]中的4个（添加$32^2$）；（iv）在测试时，我们每张图像使用1000个提议，而不是[16]中的300个。
128	FPN-based		With feature sharing, our FPN-based Faster R-CNN system has inference time of 0.148 seconds per image on a single NVIDIA M40 GPU for ResNet-50, and 0.172 seconds for ResNet-101. 通过特征共享，我们的基于FPN的Faster R-CNN系统使用ResNet-50在单个NVIDIA M40 GPU上每张图像的推断时间为0.148秒，使用ResNet-101的时间为0.172秒。
129	NVIDIA	[ɪn'vɪdɪə]	With feature sharing, our FPN-based Faster R-CNN system has inference time of 0.148 seconds per image on a single NVIDIA M40 GPU for ResNet-50, and 0.172 seconds for ResNet-101. 通过特征共享，我们的基于FPN的Faster R-CNN系统使用ResNet-50在单个NVIDIA M40 GPU上每张图像的推断时间为0.148秒，使用ResNet-101的时间为0.172秒。 ^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40. ^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。
130	M40		With feature sharing, our FPN-based Faster R-CNN system has inference time of 0.148 seconds per image on a single NVIDIA M40 GPU for ResNet-50, and 0.172 seconds for ResNet-101. 通过特征共享，我们的基于FPN的Faster R-CNN系统使用ResNet-50在单个NVIDIA M40 GPU上每张图像的推断时间为0.148秒，使用ResNet-101的时间为0.172秒。 ^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40. ^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。
131	leaderboard	['li:dərbɔ:d]	This model is the one we submitted to the COCO detection leaderboard, shown in Table 4. 该模型是我们提交给COCO检测排行榜的模型，如表4所示。
132	feature-sharing	[!≈ ˈfi:tʃə(r) 'ʃeərɪŋ]	We have not evaluated its feature-sharing version due to limited time, which should be slightly better as implied by Table 5. 由于时间有限，我们尚未评估其特征共享版本，这应该稍微好一些，如表5所示。
133	test-dev	[!≈ test dev]	Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [40] on minival). 一些在test-std数据集上的结果是不可获得的，因此我们也包括了在test-dev上的结果（和Multipath[40]在minival上的结果）。 On the test-dev set, our method increases over the existing best results by 0.5 points of AP (36.2 vs. 35.7) and 3.4 points of AP@0.5 (59.1 vs. 没有添加额外的东西，我们的单模型提交就已经超越了这些强大的，经过严格设计的竞争对手。
134	Multipath	['mʌltɪpæθ]	Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [40] on minival). 一些在test-std数据集上的结果是不可获得的，因此我们也包括了在test-dev上的结果（和Multipath[40]在minival上的结果）。
135	AttractioNet		^§: This entry of AttractioNet [10] adopts VGG-16 for proposals and Wide ResNet [39] for object detection, so is not strictly a single-model result. ^§：AttractioNet[10]的输入采用VGG-16进行目标提议，用Wide ResNet[39]进行目标检测，因此它不是严格意义上的单模型。
136	G-RMI		Table 4 compares our method with the single-model results of the COCO competition winners, including the 2016 winner G-RMI and the 2015 winner Faster R-CNN+++. Without adding bells and whistles, our single-model entry has surpassed these strong, heavily engineered competitors. 表4将我们方法的单模型结果与COCO竞赛获胜者的结果进行了比较，其中包括2016年冠军G-RMI和2015年冠军Faster R-CNN+++。
137	small-scale	[ˈsmɔ:lˈskeɪl]	It is worth noting that our method does not rely on image pyramids and only uses a single input image scale, but still has outstanding AP on small-scale objects. 值得注意的是，我们的方法不依赖图像金字塔，只使用单个输入图像尺度，但在小型目标上仍然具有出色的AP。
138	iterative	['ɪtərətɪv]	Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further. 此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。
139	mining	[ˈmaɪnɪŋ]	Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further. 此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。
140	augmentation	[ˌɔ:ɡmen'teɪʃn]	Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further. 此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。
141	complementary	[ˌkɒmplɪˈmentri]	Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further. 此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。
142	DeepMask		In this section we use FPNs to generate segmentation proposals, following the DeepMask/SharpMask framework [27, 28]. 在本节中，我们使用FPN生成分割建议，遵循DeepMask/SharpMask框架[27，28]。 DeepMask/SharpMask were trained on image crops for predicting instance segments and object/non-object scores. DeepMask/SharpMask在裁剪图像上进行训练，可以预测实例块和目标/非目标分数。 DeepMask, SharpMask, and FPN use ResNet-50 while Instance-FCN uses VGG-16. DeepMask，SharpMask和FPN使用ResNet-50，而Instance-FCN使用VGG-16。 DeepMask and SharpMask performance is computed with models available from https://github.com/facebookresearch/deepmask (both are the ‘zoom’ variants). DeepMask和SharpMask性能计算的模型是从https://github.com/facebookresearch/deepmask上获得的（都是‘zoom’变体）。 We also report comparisons to DeepMask [27], Sharp-Mask [28], and InstanceFCN [4], the previous state of the art methods in mask proposal generation. 我们还报告了与DeepMask[27]，Sharp-Mask[28]和InstanceFCN[4]的比较，这是以前的掩模提议生成中的先进方法。
143	convolutionally	[!≈ kɒnvə'lu:ʃənəli]	At inference time, these models are run convolutionally to generate dense proposals in an image. 在推断时，这些模型是卷积运行的，以在图像中生成密集的提议。
144	setup	['setʌp]	We use a fully convolutional setup for both training and inference. 我们对训练和推断都使用全卷积设置。
145	Additionally	[ə'dɪʃənəlɪ]	Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27, 28], we use a second MLP of input size 7×7 to handle half octaves. 此外，由于在[27,28]的图像金字塔中每组使用2个尺度，我们使用输入大小为7×7的第二个MLP来处理半个组。
146	Instance-FCN		DeepMask, SharpMask, and FPN use ResNet-50 while Instance-FCN uses VGG-16. DeepMask，SharpMask和FPN使用ResNet-50，而Instance-FCN使用VGG-16。
147	zoom	[zu:m]	DeepMask and SharpMask performance is computed with models available from https://github.com/facebookresearch/deepmask (both are the ‘zoom’ variants). DeepMask和SharpMask性能计算的模型是从https://github.com/facebookresearch/deepmask上获得的（都是‘zoom’变体）。
148	runtime	[rʌn'taɪm]	^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40. ^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。
149	InstanceFCN		^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40. ^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。 We also report comparisons to DeepMask [27], Sharp-Mask [28], and InstanceFCN [4], the previous state of the art methods in mask proposal generation. 我们还报告了与DeepMask[27]，Sharp-Mask[28]和InstanceFCN[4]的比较，这是以前的掩模提议生成中的先进方法。
150	K40		^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40. ^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。
151	Sharp-Mask	[!≈ ʃɑ:p mɑ:sk]	We also report comparisons to DeepMask [27], Sharp-Mask [28], and InstanceFCN [4], the previous state of the art methods in mask proposal generation. 我们还报告了与DeepMask[27]，Sharp-Mask[28]和InstanceFCN[4]的比较，这是以前的掩模提议生成中的先进方法。
152	computationally	[!≈ ˌkɒmpjuˈteɪʃənli]	Existing mask proposal methods [27, 28, 4] are based on densely sampled image pyramids (e.g., scaled by 2^{\lbrace −2:0.5:1 \rbrace} in [27, 28]), making them computationally expensive. 现有的掩码提议方法[27，28，4]是基于密集采样的图像金字塔的（例如，[27，28]中的缩放为2^{\lbrace −2:0.5:1 \rbrace}），使得它们是计算昂贵的。
153	substantially	[səbˈstænʃəli]	Our approach, based on FPNs, is substantially faster (our models run at 6 to 7 FPS). 我们的方法基于FPN，速度明显加快（我们的模型运行速度为6至7FPS）。

Words List (frequency)
#	word (frequency)	phonetic	sentence
1	FPN (30)	[!≈ ef pi: en]	This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications.这种称为特征金字塔网络（FPN）的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。 Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners.在一个基本的Faster R-CNN系统中使用FPN，没有任何不必要的东西，我们的方法可以在COCO检测基准数据集上取得最先进的单模型结果，结果超过了所有现有的单模型输入，包括COCO 2016挑战赛的获奖者。 (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate.（d）我们提出的特征金字塔网络（FPN）与（b）和（c）类似，但更准确。 We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27].我们评估了我们称为特征金字塔网络（FPN）的方法，其在各种系统中用于检测和分割[11，29，27]。 Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners.没有任何不必要的东西，我们在具有挑战性的COCO检测基准数据集上报告了最新的单模型结果，仅仅基于FPN和基本的Faster R-CNN检测器[29]，就超过了竞赛获奖者所有现存的严重工程化的单模型竞赛输入。 In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16].在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 As a result, FPNs are able to achieve higher accuracy than all existing state-of-the-art methods.因此，FPN能够比所有现有的最先进方法获得更高的准确度。 We also generalize FPNs to instance segmentation proposals in Sec.6.在第6节中我们还将FPN泛化到实例细分提议。 We adapt RPN by replacing the single-scale feature map with our FPN.我们通过用我们的FPN替换单尺度特征映射来适应RPN。 With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29].通过上述改编，RPN可以自然地通过我们的FPN进行训练和测试，与[29]中的方式相同。 To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels.要将其与我们的FPN一起使用，我们需要为金字塔等级分配不同尺度的RoI。 Training RPN with FPN on 8 GPUs takes about 8 hours on COCO.使用具有FPN的RPN在8个GPU上训练COCO数据集需要约8小时。 Placing FPN in RPN improves $AR^{1k}$ to 56.3 (Table 1 (c)), which is 8.0 points increase over the single-scale RPN baseline (Table 1 (a)).将FPN放在RPN中可将$AR^{1k}$提高到56.3（表1（c）），这比单尺度RPN基线（表1（a））增加了8.0个点。 As a results, FPN has an $AR^1k$ score 10 points higher than Table 1(e).因此，FPN的$AR^1k$的得分比表1（e）高10个点。 Next we investigate FPN for region-based (non-sliding window) detectors.接下来我们研究基于区域（非滑动窗口）检测器的FPN。 Training Fast R-CNN with FPN takes about 10 hours on the COCO dataset.使用FPN在COCO数据集上训练Fast R-CNN需要约10小时。 To better investigate FPN’s effects on the region-based detector alone, we conduct ablations of Fast R-CNN on a fixed set of proposals.为了更好地调查FPN对仅基于区域的检测器的影响，我们在一组固定的提议上进行Fast R-CNN的消融。 We choose to freeze the proposals as computed by RPN on FPN (Table 1(c)), because it has good performance on small objects that are to be recognized by the detector.我们选择冻结RPN在FPN上计算的提议（表1（c）），因为它在能被检测器识别的小目标上具有良好的性能。 Table 2(c) shows the results of our FPN in Fast R-CNN.表2（c）显示了Fast R-CNN中我们的FPN结果。 Under controlled settings, our FPN (Table 3(c)) is better than this strong baseline by 2.3 points AP and 3.8 points AP@0.5.在受控的环境下，我们的FPN（表3（c））比这个强劲的基线要好2.3个点的AP和3.8个点的AP@0.5。 More object detection results using Faster R-CNN and our FPNs, evaluated on minival.使用Faster R-CNN和我们的FPN在minival上的更多目标检测结果。 Our method introduces small extra cost by the extra layers in the FPN, but has a lighter weight head.我们的方法通过FPN中的额外层引入了较小的额外成本，但具有更轻的头部。 Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further.此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。 Recently, FPN has enabled new top results in all tracks of the COCO competition, including detection, instance segmentation, and keypoint estimation.最近，FPN在COCO竞赛的所有方面都取得了新的最佳结果，包括检测，实例分割和关键点估计。 In this section we use FPNs to generate segmentation proposals, following the DeepMask/SharpMask framework [27, 28].在本节中，我们使用FPN生成分割建议，遵循DeepMask/SharpMask框架[27，28]。 It is easy to adapt FPN to generate mask proposals.改编FPN生成掩码提议很容易。 FPN for object segment proposals.目标分割提议的FPN。 Our baseline FPN model with a single 5×5 MLP achieves an AR of 43.4.我们的具有单个5×5MLP的基线FPN模型达到了43.4的AR。 DeepMask, SharpMask, and FPN use ResNet-50 while Instance-FCN uses VGG-16.DeepMask，SharpMask和FPN使用ResNet-50，而Instance-FCN使用VGG-16。 Our approach, based on FPNs, is substantially faster (our models run at 6 to 7 FPS).我们的方法基于FPN，速度明显加快（我们的模型运行速度为6至7FPS）。
2	RPN (27)	[!≈ ɑ:(r) pi: en]	The resulting Feature Pyramid Network is general-purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11].由此产生的特征金字塔网络是通用的，在本文中，我们侧重于滑动窗口提议（Region Proposal Network，简称RPN）[29]和基于区域的检测器（Fast R-CNN）[11]。 In the following we adopt our method in RPN [29] for bounding box proposal generation and in Fast R-CNN [11] for object detection.在下面，我们采用我们的方法在RPN[29]中进行边界框提议生成，并在Fast R-CNN[11]中进行目标检测。 4.1. Feature Pyramid Networks for RPN4.1. RPN的特征金字塔网络 RPN [29] is a sliding-window class-agnostic object detector.RPN[29]是一个滑动窗口类不可知的目标检测器。 In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a single-scale convolutional feature map, performing object/non-object binary classification and bounding box regression.在原始的RPN设计中，一个小型子网络在密集的3×3滑动窗口，单尺度卷积特征映射上进行评估，执行目标/非目标的二分类和边界框回归。 We adapt RPN by replacing the single-scale feature map with our FPN.我们通过用我们的FPN替换单尺度特征映射来适应RPN。 With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29].通过上述改编，RPN可以自然地通过我们的FPN进行训练和测试，与[29]中的方式相同。 5.1. Region Proposal with RPN5.1. 区域提议与RPN For all RPN experiments (including baselines), we include the anchor boxes that are outside the image for training, which is unlike [29] where these anchor boxes are ignored.对于所有的RPN实验（包括基准数据集），我们都包含了图像外部的锚盒来进行训练，这不同于[29]中的忽略这些锚盒。 Training RPN with FPN on 8 GPUs takes about 8 hours on COCO.使用具有FPN的RPN在8个GPU上训练COCO数据集需要约8小时。 Bounding box proposal results using RPN [29], evaluated on the COCO minival set.使用RPN[29]的边界框提议结果，在COCO的minival数据集上进行评估。 For fair comparisons with original RPNs[29], we run two baselines (Table 1(a, b)) using the single-scale map of $C_4$ (the same as [16]) or $C_5$, both using the same hyper-parameters as ours, including using 5 scale anchors of $\lbrace 32^2, 64^2, 128^2, 256^2, 512^2 \rbrace$.为了与原始RPNs[29]进行公平比较，我们使用$C_4$(与[16]相同)或$C_5$的单尺度映射运行了两个基线（表1（a，b）），都使用与我们相同的超参数，包括使用5种尺度锚点$\lbrace 32^2, 64^2, 128^2, 256^2, 512^2 \rbrace$。 Placing FPN in RPN improves $AR^{1k}$ to 56.3 (Table 1 (c)), which is 8.0 points increase over the single-scale RPN baseline (Table 1 (a)).将FPN放在RPN中可将$AR^{1k}$提高到56.3（表1（c）），这比单尺度RPN基线（表1（a））增加了8.0个点。 Placing FPN in RPN improves $AR^{1k}$ to 56.3 (Table 1 (c)), which is 8.0 points increase over the single-scale RPN baseline (Table 1 (a)).将FPN放在RPN中可将$AR^{1k}$提高到56.3（表1（c）），这比单尺度RPN基线（表1（a））增加了8.0个点。 Our pyramid representation greatly improves RPN’s robustness to object scale variation.我们的金字塔表示大大提高了RPN对目标尺度变化的鲁棒性。 The results in Table 1(d) are just on par with the RPN baseline and lag far behind ours.表1（d）中的结果与RPN基线相当，并且远远落后于我们的结果。 RPN is a sliding window detector with a fixed window size, so scanning over pyramid levels can increase its robustness to scale variance.RPN是一个具有固定窗口大小的滑动窗口检测器，因此在金字塔层级上扫描可以增加其对尺度变化的鲁棒性。 We choose to freeze the proposals as computed by RPN on FPN (Table 1(c)), because it has good performance on small objects that are to be recognized by the detector.我们选择冻结RPN在FPN上计算的提议（表1（c）），因为它在能被检测器识别的小目标上具有良好的性能。 For simplicity we do not share features between Fast R-CNN and RPN, except when specified.为了简单起见，我们不在Fast R-CNN和RPN之间共享特征，除非指定。 Object detection results using Fast R-CNN [11] on a fixed set of proposals (RPN, ${P_k}$, Table 1(c)), evaluated on the COCO minival set.使用Fast R-CNN[11]在一组固定提议（RPN，${P_k}$，表1（c））上的目标检测结果，在COCO的minival数据集上进行评估。 Table 2(d) and (e) show that removing top-down connections or removing lateral connections leads to inferior results, similar to what we have observed in the above sub-section for RPN.表2（d）和（e）表明，去除自上而下的连接或去除横向连接会导致较差的结果，类似于我们在上面的RPN小节中观察到的结果。 Despite the good accuracy of this variant, it is based on the RPN proposals of ${P_k}$ and has thus already benefited from the pyramid representation.尽管这个变体具有很好的准确性，但它是基于${P_k}$的RPN提议的，因此已经从金字塔表示中受益。 But in a Faster R-CNN system [29], the RPN and Fast R-CNN must use the same network backbone in order to make feature sharing possible.但是在Faster R-CNN系统中[29]，RPN和Fast R-CNN必须使用相同的骨干网络来实现特征共享。 Table 3 shows the comparisons between our method and two baselines, all using consistent backbone architectures for RPN and Fast R-CNN.表3显示了我们的方法和两个基线之间的比较，所有这些RPN和Fast R-CNN都使用一致的骨干架构。 The backbone network for RPN are consistent with Fast R-CNN.RPN与Fast R-CNN的骨干网络是一致的。 In the above, for simplicity we do not share the features between RPN and Fast R-CNN.在上面，为了简单起见，我们不共享RPN和Fast R-CNN之间的特征。 The two MLPs play a similar role as anchors in RPN.这两个MLP在RPN中扮演着类似于锚点的角色。
3	ConvNet (17)		(c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid.（c）另一种方法是重用ConvNet计算的金字塔特征层次结构，就好像它是一个特征化的图像金字塔。 For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20].对于识别任务，工程特征大部分已经被深度卷积网络（ConvNets）[19，20]计算的特征所取代。 Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)).除了能够表示更高级别的语义，ConvNets对于尺度变化也更加鲁棒，从而有助于从单一输入尺度上计算的特征进行识别[15，11，29]（图1（b））。 A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape.深层ConvNet逐层计算特征层级，而对于下采样层，特征层级具有内在的多尺度金字塔形状。 The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)).单次检测器（SSD）[22]是首先尝试使用ConvNet的金字塔特征层级中的一个，好像它是一个特征化的图像金字塔（图1（c））。 The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales.本文的目标是自然地利用ConvNet特征层级的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。 Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.在HOG和SIFT之前，使用ConvNet[38，32]的早期人脸检测工作计算了图像金字塔上的浅网络，以检测跨尺度的人脸。 Deep ConvNet object detectors.Deep ConvNet目标检测器。 With the development of modern deep ConvNets [19], object detectors like OverFeat [34] and R-CNN [12] showed dramatic improvements in accuracy.随着现代深度卷积网络[19]的发展，像OverFeat[34]和R-CNN[12]这样的目标检测器在精度上显示出了显著的提高。 OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid.OverFeat采用了一种类似于早期神经网络人脸检测器的策略，通过在图像金字塔上应用ConvNet作为滑动窗口检测器。 R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet.R-CNN采用了基于区域提议的策略[37]，其中每个提议在用ConvNet进行分类之前都进行了尺度归一化。 A number of recent approaches improve detection and segmentation by using different layers in a ConvNet.一些最近的方法通过使用ConvNet中的不同层来改进检测和分割。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout.我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。 The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2.自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。 Our method is a generic solution for building feature pyramids inside deep ConvNets.我们的方法是在深度ConvNets内部构建特征金字塔的通用解决方案。 We have presented a clean and simple framework for building feature pyramids inside ConvNets.我们提出了一个干净而简单的框架，用于在ConvNets内部构建特征金字塔。 Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multi-scale problems using pyramid representations.最后，我们的研究表明，尽管深层ConvNets具有强大的表示能力以及它们对尺度变化的隐式鲁棒性，但使用金字塔表示对于明确地解决多尺度问题仍然至关重要。
4	lateral (15)	[ˈlætərəl]	A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales.开发了一种具有横向连接的自顶向下架构，用于在所有尺度上构建高级语义特征映射。 To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)).为了实现这个目标，我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。（图1（d））。 There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation.最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。 The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.如下所述，我们的金字塔结构包括自下而上的路径，自上而下的路径和横向连接。 Top-down pathway and lateral connections.自顶向下的路径和横向连接。 These features are then enhanced with features from the bottom-up pathway via lateral connections.这些特征随后通过来自自下而上路径上的特征经由横向连接进行增强。 Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway.每个横向连接合并来自自下而上路径和自顶向下路径的具有相同空间大小的特征映射。 A building block illustrating the lateral connection and the top-down pathway, merged by addition.构建模块说明了横向连接和自顶向下路径，通过加法合并。 The columns “lateral” and “top-down” denote the presence of lateral and top-down connections, respectively.列“lateral”和“top-down”分别表示横向连接和自上而下连接的存在。 The columns “lateral” and “top-down” denote the presence of lateral and top-down connections, respectively.列“lateral”和“top-down”分别表示横向连接和自上而下连接的存在。 With this modification, the 1×1 lateral connections followed by 3×3 convolutions are attached to the bottom-up pyramid.通过这种修改，将1×1横向连接和后面的3×3卷积添加到自下而上的金字塔中。 How important are lateral connections? Table 1(e) shows the ablation results of a top-down feature pyramid without the 1×1 lateral connections.横向连接有多重要？表1（e）显示了没有1×1横向连接的自顶向下特征金字塔的消融结果。 How important are lateral connections? Table 1(e) shows the ablation results of a top-down feature pyramid without the 1×1 lateral connections.横向连接有多重要？表1（e）显示了没有1×1横向连接的自顶向下特征金字塔的消融结果。 More precise locations of features can be directly passed from the finer levels of the bottom-up maps via the lateral connections to the top-down maps.更精确的特征位置可以通过横向连接直接从自下而上映射的更精细层级传递到自上而下的映射。 Table 2(d) and (e) show that removing top-down connections or removing lateral connections leads to inferior results, similar to what we have observed in the above sub-section for RPN.表2（d）和（e）表明，去除自上而下的连接或去除横向连接会导致较差的结果，类似于我们在上面的RPN小节中观察到的结果。
5	semantic (15)	[sɪˈmæntɪk]	A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales.开发了一种具有横向连接的自顶向下架构，用于在所有尺度上构建高级语义特征映射。 Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)).除了能够表示更高级别的语义，ConvNets对于尺度变化也更加鲁棒，从而有助于从单一输入尺度上计算的特征进行识别[15，11，29]（图1（b））。 This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths.这种网内特征层级产生不同空间分辨率的特征映射，但引入了由不同深度引起的较大的语义差异。 The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales.本文的目标是自然地利用ConvNet特征层级的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。 The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale.其结果是一个特征金字塔，在所有级别都具有丰富的语义，并且可以从单个输入图像尺度上进行快速构建。 FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations.FCN[24]将多个尺度上的每个类别的部分分数相加以计算语义分割。 There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation.最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout.我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout.我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。 The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。 The good performance of sharing parameters indicates that all levels of our pyramid share similar semantic levels.共享参数的良好性能表明我们的金字塔的所有层级共享相似的语义级别。 Table 1 (b) shows no advantage over (a), indicating that a single higher-level feature map is not enough because there is a trade-off between coarser resolutions and stronger semantics.表1（b）显示没有优于（a），这表明单个更高级别的特征映射是不够的，因为存在在较粗分辨率和较强语义之间的权衡。 We conjecture that this is because there are large semantic gaps between different levels on the bottom-up pyramid (Fig. 1(b)), especially for very deep ResNets.我们推测这是因为自下而上的金字塔（图1（b））的不同层次之间存在较大的语义差距，尤其是对于非常深的ResNets。 This top-down pyramid has strong semantic features and fine resolutions.这个自顶向下的金字塔具有强大的语义特征和良好的分辨率。 How important are pyramid representations? Instead of resorting to pyramid representations, one can attach the head to the highest-resolution, strongly semantic feature maps of $P_2$ (i.e., the finest level in our pyramids).金字塔表示有多重要？可以将头部附加到$P_2$的最高分辨率的强语义特征映射上（即我们金字塔中的最好层级），而不采用金字塔表示。
6	roi (13)	[rwɑ:]	Fast R-CNN [11] is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features.Fast R-CNN[11]是一个基于区域的目标检测器，利用感兴趣区域（RoI）池化来提取特征。 To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels.要将其与我们的FPN一起使用，我们需要为金字塔等级分配不同尺度的RoI。 Formally, we assign an RoI of width $w$ and height $h$ (on the input image to the network) to the level $P_k$ of our feature pyramid by:在形式上，我们通过以下公式将宽度为$w$和高度为$h$（在网络上的输入图像上）的RoI分配到特征金字塔的级别$P_k$上： Here 224 is the canonical ImageNet pre-training size, and $k_0$ is the target level on which an RoI with $w\times h=224^2$ should be mapped into.这里224是规范的ImageNet预训练大小，而$k_0$是大小为$w \times h=224^2$的RoI应该映射到的目标级别。 (1) means that if the RoI’s scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, $k=3$).直觉上，方程（1）意味着如果RoI的尺寸变小了（比如224的1/2），它应该被映射到一个更精细的分辨率级别（比如k=3）。 We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels.我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。 So unlike [16], we simply adopt RoI pooling to extract 7×7 features, and attach two hidden 1,024-d fully-connected (fc) layers (each followed by ReLU) before the final classification and bounding box regression layers.因此，与[16]不同，我们只是采用RoI池化提取7×7特征，并在最终的分类层和边界框回归层之前附加两个隐藏单元为1024维的全连接（fc）层（每层后都接ReLU层）。 Each mini-batch involves 2 image per GPU and 512 RoIs per image.每个小批量数据包括每个GPU2张图像和每张图像上512个RoI。 We use 2000 RoIs per image for training and 1000 for testing.我们每张图像使用2000个RoIs进行训练，1000个RoI进行测试。 As a ResNet-based Fast R-CNN baseline, following [16], we adopt RoI pooling with an output size of 14×14 and attach all conv5 layers as the hidden layers of the head.作为基于ResNet的Fast R-CNN基线，遵循[16]，我们采用输出尺寸为14×14的RoI池化，并将所有conv5层作为头部的隐藏层。 We argue that this is because RoI pooling is a warping-like operation, which is less sensitive to the region’s scales.我们认为这是因为RoI池化是一种扭曲式的操作，对区域尺度较不敏感。 We find the following implementations contribute to the gap: (i) We use an image scale of 800 pixels instead of 600 in [11, 16]; (ii) We train with 512 RoIs per image which accelerate convergence, in contrast to 64 RoIs in [11, 16]; (iii) We use 5 scale anchors instead of 4 in [16] (adding $32^2$); (iv) At test time we use 1000 proposals per image instead of 300 in [16].我们发现以下实现有助于缩小差距：（i）我们使用800像素的图像尺度，而不是[11，16]中的600像素；（ii）与[11，16]中的64个ROI相比，我们训练时每张图像有512个ROIs，可以加速收敛；（iii）我们使用5个尺度的锚点，而不是[16]中的4个（添加$32^2$）；（iv）在测试时，我们每张图像使用1000个提议，而不是[16]中的300个。 We find the following implementations contribute to the gap: (i) We use an image scale of 800 pixels instead of 600 in [11, 16]; (ii) We train with 512 RoIs per image which accelerate convergence, in contrast to 64 RoIs in [11, 16]; (iii) We use 5 scale anchors instead of 4 in [16] (adding $32^2$); (iv) At test time we use 1000 proposals per image instead of 300 in [16].我们发现以下实现有助于缩小差距：（i）我们使用800像素的图像尺度，而不是[11，16]中的600像素；（ii）与[11，16]中的64个ROI相比，我们训练时每张图像有512个ROIs，可以加速收敛；（iii）我们使用5个尺度的锚点，而不是[16]中的4个（添加$32^2$）；（iv）在测试时，我们每张图像使用1000个提议，而不是[16]中的300个。
7	featurize (12)	['fi:tʃәraiz]	Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)).建立在图像金字塔之上的特征金字塔（我们简称为特征化图像金字塔）构成了标准解决方案的基础[1]（图1（a））。 (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid.（c）另一种方法是重用ConvNet计算的金字塔特征层次结构，就好像它是一个特征化的图像金字塔。 Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25].特征化图像金字塔在手工设计的时代被大量使用[5，25]。 All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]).在ImageNet[33]和COCO[21]检测挑战中，最近的所有排名靠前的输入都使用了针对特征化图像金字塔的多尺度测试（例如[16，35]）。 For these reasons, Fast and Faster R-CNN [11, 29] opt to not use featurized image pyramids under default settings.出于这些原因，Fast和Faster R-CNN[11，29]选择在默认设置下不使用特征化图像金字塔。 The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)).单次检测器（SSD）[22]是首先尝试使用ConvNet的金字塔特征层级中的一个，好像它是一个特征化的图像金字塔（图1（c））。 In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.换句话说，我们展示了如何创建网络中的特征金字塔，可以用来代替特征化的图像金字塔，而不牺牲表示能力，速度或内存。 Our model echoes a featurized image pyramid, which has not been explored in these works.我们的模型反映了一个特征化的图像金字塔，这在这些研究中还没有探索过。 There has also been significant interest in computing featurized image pyramids quickly.这对快速计算特征化图像金字塔也很有意义。 Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2.尽管这些方法采用的是金字塔形状的架构，但它们不同于特征化的图像金字塔[5，7，34]，其中所有层次上的预测都是独立进行的，参见图2。 Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d) in all the feature maps.由于金字塔的所有层都像传统的特征图像金字塔一样使用共享分类器/回归器，因此我们在所有特征映射中固定特征维度（通道数记为d）。 This advantage is analogous to that of using a featurized image pyramid, where a common head classifier can be applied to features computed at any image scale.这个优点类似于使用特征图像金字塔的优点，其中可以将常见头部分类器应用于在任何图像尺度下计算的特征。
8	pathway (12)	[ˈpɑ:θweɪ]	To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)).为了实现这个目标，我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。（图1（d））。 The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.如下所述，我们的金字塔结构包括自下而上的路径，自上而下的路径和横向连接。 The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.如下所述，我们的金字塔结构包括自下而上的路径，自上而下的路径和横向连接。 Bottom-up pathway.自下而上的路径。 The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2.自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。 Top-down pathway and lateral connections.自顶向下的路径和横向连接。 The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels.自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。 These features are then enhanced with features from the bottom-up pathway via lateral connections.这些特征随后通过来自自下而上路径上的特征经由横向连接进行增强。 Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway.每个横向连接合并来自自下而上路径和自顶向下路径的具有相同空间大小的特征映射。 Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway.每个横向连接合并来自自下而上路径和自顶向下路径的具有相同空间大小的特征映射。 A building block illustrating the lateral connection and the top-down pathway, merged by addition.构建模块说明了横向连接和自顶向下路径，通过加法合并。 How important is top-down enrichment? Table 1(d) shows the results of our feature pyramid without the top-down pathway.自上而下的改进有多重要？表1（d）显示了没有自上而下路径的特征金字塔的结果。
9	MLP (10)	[!≈ em el pi:]	Note that compared to the standard conv5 head, our 2-fc MLP head is lighter weight and faster.请注意，与标准的conv5头部相比，我们的2-fc MLP头部更轻更快。 Table 2(b) is a baseline exploiting an MLP head with 2 hidden fc layers, similar to the head in our architecture.表2（b）是利用MLP头部的基线，其具有2个隐藏的fc层，类似于我们的架构中的头部。 On top of each level of the feature pyramid, we apply a small 5×5 MLP to predict 14×14 masks and object scores in a fully convolutional fashion, see Fig. 4.在特征金字塔的每个层级上，我们应用一个小的5×5MLP以全卷积方式预测14×14掩码和目标分数，参见图4。 Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27, 28], we use a second MLP of input size 7×7 to handle half octaves.此外，由于在[27,28]的图像金字塔中每组使用2个尺度，我们使用输入大小为7×7的第二个MLP来处理半个组。 The two MLPs play a similar role as anchors in RPN.这两个MLP在RPN中扮演着类似于锚点的角色。 We apply a small MLP on 5x5 windows to generate dense object segments with output dimension of 14x14.我们在5x5窗口上应用一个小的MLP来生成输出尺寸为14x14的密集目标块。 Half octaves are handled by an MLP on 7x7 windows ($7 \approx 5 \sqrt 2$), not shown here.半个组由MLP在7x7窗口（ $7 \ approx 5 \ sqrt 2 $）处理，此处未展示。 Our baseline FPN model with a single 5×5 MLP achieves an AR of 43.4.我们的具有单个5×5MLP的基线FPN模型达到了43.4的AR。 Switching to a slightly larger 7×7 MLP leaves accuracy largely unchanged.切换到稍大的7×7MLP，精度基本保持不变。 Using both MLPs together increases accuracy to 45.7 AR.同时使用两个MLP将精度提高到了45.7的AR。
10	pyramidal (9)	['pɪrəmɪdl]	In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost.在本文中，我们利用深度卷积网络内在的多尺度、金字塔分级来构造具有很少额外成本的特征金字塔。 (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid.（c）另一种方法是重用ConvNet计算的金字塔特征层次结构，就好像它是一个特征化的图像金字塔。 A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape.深层ConvNet逐层计算特征层级，而对于下采样层，特征层级具有内在的多尺度金字塔形状。 The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)).单次检测器（SSD）[22]是首先尝试使用ConvNet的金字塔特征层级中的一个，好像它是一个特征化的图像金字塔（图1（c））。 The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales.本文的目标是自然地利用ConvNet特征层级的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。 Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2.尽管这些方法采用的是金字塔形状的架构，但它们不同于特征化的图像金字塔[5，7，34]，其中所有层次上的预测都是独立进行的，参见图2。 In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28].事实上，对于图2（顶部）中的金字塔结构，图像金字塔仍然需要跨多个尺度上识别目标[28]。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout.我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。 This architecture simulates the effect of reusing the pyramidal feature hierarchy (Fig. 1(b)).该架构模拟了重用金字塔特征层次结构的效果（图1（b））。
11	e.g. (9)	[ˌi: ˈdʒi:]	They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave).它们非常关键，以至于像DPM[7]这样的目标检测器需要密集的尺度采样才能获得好的结果（例如每组10个尺度，octave含义参考SIFT特征）。 All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]).在ImageNet[33]和COCO[21]检测挑战中，最近的所有排名靠前的输入都使用了针对特征化图像金字塔的多尺度测试（例如[16，35]）。 Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications.推断时间显著增加（例如，四倍[11]），使得这种方法在实际应用中不切实际。 But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4_3 of VGG nets [36]) and then by adding several new layers.但为了避免使用低级特征，SSD放弃重用已经计算好的图层，而从网络中的最高层开始构建金字塔（例如，VGG网络的conv4_3[36]），然后添加几个新层。 On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom).相反，我们的方法利用这个架构作为特征金字塔，其中预测（例如目标检测）在每个级别上独立进行（图2底部）。 Top: a top-down architecture with skip connections, where predictions are made on the finest level (e.g., [28]).顶部：带有跳跃连接的自顶向下的架构，在最好的级别上进行预测（例如，[28]）。 This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16].这个过程独立于主卷积体系结构（例如[19，36，16]），在本文中，我们呈现了使用ResNets[16]的结果。 We have experimented with more sophisticated blocks (e.g., using multi-layer residual blocks [16] as the connections) and observed marginally better results.我们已经尝试了更复杂的块（例如，使用多层残差块[16]作为连接）并观察到稍微更好的结果。 Existing mask proposal methods [27, 28, 4] are based on densely sampled image pyramids (e.g., scaled by 2^{\lbrace −2:0.5:1 \rbrace} in [27, 28]), making them computationally expensive.现有的掩码提议方法[27，28，4]是基于密集采样的图像金字塔的（例如，[27，28]中的缩放为2^{\lbrace −2:0.5:1 \rbrace}），使得它们是计算昂贵的。
12	bounding (8)	[baundɪŋ]	In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16].在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 In the following we adopt our method in RPN [29] for bounding box proposal generation and in Fast R-CNN [11] for object detection.在下面，我们采用我们的方法在RPN[29]中进行边界框提议生成，并在Fast R-CNN[11]中进行目标检测。 In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a single-scale convolutional feature map, performing object/non-object binary classification and bounding box regression.在原始的RPN设计中，一个小型子网络在密集的3×3滑动窗口，单尺度卷积特征映射上进行评估，执行目标/非目标的二分类和边界框回归。 The object/non-object criterion and bounding box regression target are defined with respect to a set of reference boxes called anchors [29].目标/非目标标准和边界框回归目标的定义是关于一组称为锚点的参考框的[29]。 We assign training labels to the anchors based on their Intersection-over-Union (IoU) ratios with ground-truth bounding boxes as in [29].如[29]，我们根据锚点和实际边界框的交并比（IoU）比例将训练标签分配给锚点。 We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels.我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。 So unlike [16], we simply adopt RoI pooling to extract 7×7 features, and attach two hidden 1,024-d fully-connected (fc) layers (each followed by ReLU) before the final classification and bounding box regression layers.因此，与[16]不同，我们只是采用RoI池化提取7×7特征，并在最终的分类层和边界框回归层之前附加两个隐藏单元为1024维的全连接（fc）层（每层后都接ReLU层）。 Bounding box proposal results using RPN [29], evaluated on the COCO minival set.使用RPN[29]的边界框提议结果，在COCO的minival数据集上进行评估。
13	minival (7)		We train using the union of 80k train images and a 35k subset of val images (trainval35k [2]), and report ablations on a 5k subset of val images (minival).我们训练使用80k张训练图像和35k大小的验证图像子集（trainval35k[2]）的联合，并报告了在5k大小的验证图像子集（minival）上的消融实验。 Bounding box proposal results using RPN [29], evaluated on the COCO minival set.使用RPN[29]的边界框提议结果，在COCO的minival数据集上进行评估。 Object detection results using Fast R-CNN [11] on a fixed set of proposals (RPN, ${P_k}$, Table 1(c)), evaluated on the COCO minival set.使用Fast R-CNN[11]在一组固定提议（RPN，${P_k}$，表1（c））上的目标检测结果，在COCO的minival数据集上进行评估。 Object detection results using Faster R-CNN [29] evaluated on the COCO minival set.使用Faster R-CNN[29]在COCOminival数据集上评估的目标检测结果。 More object detection results using Faster R-CNN and our FPNs, evaluated on minival.使用Faster R-CNN和我们的FPN在minival上的更多目标检测结果。 This increases AP on minival to 35.6, without sharing features.这将minival上的AP增加到了35.6，没有共享特征。 Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [40] on minival).一些在test-std数据集上的结果是不可获得的，因此我们也包括了在test-dev上的结果（和Multipath[40]在minival上的结果）。
14	backbone (6)	[ˈbækbəʊn]	This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16].这个过程独立于主卷积体系结构（例如[19，36，16]），在本文中，我们呈现了使用ResNets[16]的结果。 The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2.自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。 As is common practice [12], all network backbones are pre-trained on the ImageNet1k classification set [33] and then fine-tuned on the detection dataset.正如通常的做法[12]，所有的网络骨干都是在ImageNet1k分类集[33]上预先训练好的，然后在检测数据集上进行微调。 But in a Faster R-CNN system [29], the RPN and Fast R-CNN must use the same network backbone in order to make feature sharing possible.但是在Faster R-CNN系统中[29]，RPN和Fast R-CNN必须使用相同的骨干网络来实现特征共享。 Table 3 shows the comparisons between our method and two baselines, all using consistent backbone architectures for RPN and Fast R-CNN.表3显示了我们的方法和两个基线之间的比较，所有这些RPN和Fast R-CNN都使用一致的骨干架构。 The backbone network for RPN are consistent with Fast R-CNN.RPN与Fast R-CNN的骨干网络是一致的。
15	semantically (5)	[sɪ'mæntɪklɪ]	In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features.在该图中，特征映射用蓝色轮廓表示，较粗的轮廓表示语义上较强的特征。 The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.对图像金字塔的每个层次进行特征化的主要优势在于它产生了多尺度的特征表示，其中所有层次上在语义上都很强，包括高分辨率层。 To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)).为了实现这个目标，我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。（图1（d））。 To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)).为了实现这个目标，我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。（图1（d））。 The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels.自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。
16	ablation (5)	[əˈbleɪʃn]	In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16].在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 We train using the union of 80k train images and a 35k subset of val images (trainval35k [2]), and report ablations on a 5k subset of val images (minival).我们训练使用80k张训练图像和35k大小的验证图像子集（trainval35k[2]）的联合，并报告了在5k大小的验证图像子集（minival）上的消融实验。 5.1.1 Ablation Experiments5.1.1 消融实验 How important are lateral connections? Table 1(e) shows the ablation results of a top-down feature pyramid without the 1×1 lateral connections.横向连接有多重要？表1（e）显示了没有1×1横向连接的自顶向下特征金字塔的消融结果。 To better investigate FPN’s effects on the region-based detector alone, we conduct ablations of Fast R-CNN on a fixed set of proposals.为了更好地调查FPN对仅基于区域的检测器的影响，我们在一组固定的提议上进行Fast R-CNN的消融。
17	SharpMask (5)		There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation.最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。 In this section we use FPNs to generate segmentation proposals, following the DeepMask/SharpMask framework [27, 28].在本节中，我们使用FPN生成分割建议，遵循DeepMask/SharpMask框架[27，28]。 DeepMask/SharpMask were trained on image crops for predicting instance segments and object/non-object scores.DeepMask/SharpMask在裁剪图像上进行训练，可以预测实例块和目标/非目标分数。 DeepMask, SharpMask, and FPN use ResNet-50 while Instance-FCN uses VGG-16.DeepMask，SharpMask和FPN使用ResNet-50，而Instance-FCN使用VGG-16。 DeepMask and SharpMask performance is computed with models available from https://github.com/facebookresearch/deepmask (both are the ‘zoom’ variants).DeepMask和SharpMask性能计算的模型是从https://github.com/facebookresearch/deepmask上获得的（都是‘zoom’变体）。
18	DeepMask (5)		In this section we use FPNs to generate segmentation proposals, following the DeepMask/SharpMask framework [27, 28].在本节中，我们使用FPN生成分割建议，遵循DeepMask/SharpMask框架[27，28]。 DeepMask/SharpMask were trained on image crops for predicting instance segments and object/non-object scores.DeepMask/SharpMask在裁剪图像上进行训练，可以预测实例块和目标/非目标分数。 DeepMask, SharpMask, and FPN use ResNet-50 while Instance-FCN uses VGG-16.DeepMask，SharpMask和FPN使用ResNet-50，而Instance-FCN使用VGG-16。 DeepMask and SharpMask performance is computed with models available from https://github.com/facebookresearch/deepmask (both are the ‘zoom’ variants).DeepMask和SharpMask性能计算的模型是从https://github.com/facebookresearch/deepmask上获得的（都是‘zoom’变体）。 We also report comparisons to DeepMask [27], Sharp-Mask [28], and InstanceFCN [4], the previous state of the art methods in mask proposal generation.我们还报告了与DeepMask[27]，Sharp-Mask[28]和InstanceFCN[4]的比较，这是以前的掩模提议生成中的先进方法。
19	generic (4)	[dʒəˈnerɪk]	This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications.这种称为特征金字塔网络（FPN）的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。 Our method is a generic solution for building feature pyramids inside deep ConvNets.我们的方法是在深度ConvNets内部构建特征金字塔的通用解决方案。 Our method is a generic pyramid representation and can be used in applications other than object detection.我们的方法是一种通用金字塔表示，可用于除目标检测之外的其他应用。 These results demonstrate that our model is a generic feature extractor and can replace image pyramids for other multi-scale detection problems.这些结果表明，我们的模型是一个通用的特征提取器，可以替代图像金字塔以用于其他多尺度检测问题。
20	octave (4)	[ˈɒktɪv]	They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave).它们非常关键，以至于像DPM[7]这样的目标检测器需要密集的尺度采样才能获得好的结果（例如每组10个尺度，octave含义参考SIFT特征）。 Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27, 28], we use a second MLP of input size 7×7 to handle half octaves.此外，由于在[27,28]的图像金字塔中每组使用2个尺度，我们使用输入大小为7×7的第二个MLP来处理半个组。 Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27, 28], we use a second MLP of input size 7×7 to handle half octaves.此外，由于在[27,28]的图像金字塔中每组使用2个尺度，我们使用输入大小为7×7的第二个MLP来处理半个组。 Half octaves are handled by an MLP on 7x7 windows ($7 \approx 5 \sqrt 2$), not shown here.半个组由MLP在7x7窗口（ $7 \ approx 5 \ sqrt 2 $）处理，此处未展示。
21	robustness (4)	[rəʊ'bʌstnəs]	But even with this robustness, pyramids are still needed to get the most accurate results.但即使有这种鲁棒性，金字塔仍然需要得到最准确的结果。 Our pyramid representation greatly improves RPN’s robustness to object scale variation.我们的金字塔表示大大提高了RPN对目标尺度变化的鲁棒性。 RPN is a sliding window detector with a fixed window size, so scanning over pyramid levels can increase its robustness to scale variance.RPN是一个具有固定窗口大小的滑动窗口检测器，因此在金字塔层级上扫描可以增加其对尺度变化的鲁棒性。 Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multi-scale problems using pyramid representations.最后，我们的研究表明，尽管深层ConvNets具有强大的表示能力以及它们对尺度变化的隐式鲁棒性，但使用金字塔表示对于明确地解决多尺度问题仍然至关重要。
22	leverage (4)	[ˈli:vərɪdʒ]	The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales.本文的目标是自然地利用ConvNet特征层级的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。 On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom).相反，我们的方法利用这个架构作为特征金字塔，其中预测（例如目标检测）在每个级别上独立进行（图2底部）。 Bottom: our model that has a similar structure but leverages it as a feature pyramid, with predictions made independently at all levels.底部：我们的模型具有类似的结构，但将其用作特征金字塔，并在各个层级上独立进行预测。 Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout.我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。
23	SIFT (4)	[sɪft]	SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching.SIFT特征[25]最初是从尺度空间极值中提取的，用于特征点匹配。 HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids.HOG特征[5]，以及后来的SIFT特征，都是在整个图像金字塔上密集计算的。 These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more.这些HOG和SIFT金字塔已在许多工作中得到了应用，用于图像分类，目标检测，人体姿势估计等。 Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.在HOG和SIFT之前，使用ConvNet[38，32]的早期人脸检测工作计算了图像金字塔上的浅网络，以检测跨尺度的人脸。
24	trainval35k (4)		We train using the union of 80k train images and a 35k subset of val images (trainval35k [2]), and report ablations on a 5k subset of val images (minival).我们训练使用80k张训练图像和35k大小的验证图像子集（trainval35k[2]）的联合，并报告了在5k大小的验证图像子集（minival）上的消融实验。 All models are trained on trainval35k.所有模型都是通过trainval35k训练的。 Models are trained on the trainval35k set.模型在trainval35k数据集上训练。 Models are trained on the trainval35k set and use ResNet-50. ^†Provided by authors of [16].模型在trainval35k数据集上训练并使用ResNet-50。^†由[16]的作者提供。
25	variant (4)	[ˈveəriənt]	We have also evaluated a variant of Table 1(d) without sharing the parameters of the heads, but observed similarly degraded performance. 我们还评估了表1（d）的一个变体，但没有分享磁头的参数，但观察到类似的性能下降。 This variant (Table 1(f)) is better than the baseline but inferior to our approach.这个变体（表1（f））比基线要好，但不如我们的方法。 Despite the good accuracy of this variant, it is based on the RPN proposals of ${P_k}$ and has thus already benefited from the pyramid representation.尽管这个变体具有很好的准确性，但它是基于${P_k}$的RPN提议的，因此已经从金字塔表示中受益。 DeepMask and SharpMask performance is computed with models available from https://github.com/facebookresearch/deepmask (both are the ‘zoom’ variants).DeepMask和SharpMask性能计算的模型是从https://github.com/facebookresearch/deepmask上获得的（都是‘zoom’变体）。
26	surpass (3)	[səˈpɑ:s]	Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners.在一个基本的Faster R-CNN系统中使用FPN，没有任何不必要的东西，我们的方法可以在COCO检测基准数据集上取得最先进的单模型结果，结果超过了所有现有的单模型输入，包括COCO 2016挑战赛的获奖者。 Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners.没有任何不必要的东西，我们在具有挑战性的COCO检测基准数据集上报告了最新的单模型结果，仅仅基于FPN和基本的Faster R-CNN检测器[29]，就超过了竞赛获奖者所有现存的严重工程化的单模型竞赛输入。 Table 4 compares our method with the single-model results of the COCO competition winners, including the 2016 winner G-RMI and the 2015 winner Faster R-CNN+++. Without adding bells and whistles, our single-model entry has surpassed these strong, heavily engineered competitors.表4将我们方法的单模型结果与COCO竞赛获胜者的结果进行了比较，其中包括2016年冠军G-RMI和2015年冠军Faster R-CNN+++。
27	representational (3)	[ˌreprɪzenˈteɪʃnl]	The high-resolution maps have low-level features that harm their representational capacity for object recognition.高分辨率映射具有损害其目标识别表示能力的低级特征。 In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.换句话说，我们展示了如何创建网络中的特征金字塔，可以用来代替特征化的图像金字塔，而不牺牲表示能力，速度或内存。 Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multi-scale problems using pyramid representations.最后，我们的研究表明，尽管深层ConvNets具有强大的表示能力以及它们对尺度变化的隐式鲁棒性，但使用金字塔表示对于明确地解决多尺度问题仍然至关重要。
28	COCO-style (3)	[!≈ 'kəʊkəʊ staɪl]	In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16].在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 We evaluate the COCO-style Average Recall (AR) and AR on small, medium, and large objects ($AR_s$, $AR_m$, and $AR_l$) following the definitions in [21].根据[21]中的定义，我们评估了COCO类型的平均召回率（AR）和在小型，中型和大型目标($AR_s$, $AR_m$, and $AR_lv)上的AR。 We evaluate object detection by the COCO-style Average Precision (AP) and PASCAL-style AP (at a single IoU threshold of 0.5).我们通过COCO类型的平均精度（AP）和PASCAL类型的AP（单个IoU阈值为0.5）来评估目标检测。
29	HOG (3)	[hɒg]	HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids.HOG特征[5]，以及后来的SIFT特征，都是在整个图像金字塔上密集计算的。 These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more.这些HOG和SIFT金字塔已在许多工作中得到了应用，用于图像分类，目标检测，人体姿势估计等。 Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.在HOG和SIFT之前，使用ConvNet[38，32]的早期人脸检测工作计算了图像金字塔上的浅网络，以检测跨尺度的人脸。
30	residual (3)	[rɪˈzɪdjuəl]	Specifically, for ResNets [16] we use the feature activations output by each stage’s last residual block.具体而言，对于ResNets[16]，我们使用每个阶段的最后一个残差块输出的特征激活。 We denote the output of these last residual blocks as $\lbrace C_2 , C_3 , C_4 , C_5 \rbrace$ for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of {4, 8, 16, 32} pixels with respect to the input image.对于conv2，conv3，conv4和conv5输出，我们将这些最后残差块的输出表示为$\lbrace C_2, C_3, C_4, C_5 \rbrace$，并注意相对于输入图像它们的步长为{4，8，16，32}个像素。 We have experimented with more sophisticated blocks (e.g., using multi-layer residual blocks [16] as the connections) and observed marginally better results.我们已经尝试了更复杂的块（例如，使用多层残差块[16]作为连接）并观察到稍微更好的结果。
31	extractor (2)	[ɪkˈstræktə(r)]	This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications.这种称为特征金字塔网络（FPN）的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。 These results demonstrate that our model is a generic feature extractor and can replace image pyramids for other multi-scale detection problems.这些结果表明，我们的模型是一个通用的特征提取器，可以替代图像金字塔以用于其他多尺度检测问题。
32	FPS (2)	['efp'i:'es]	In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection.此外，我们的方法可以在GPU上以6FPS运行，因此是多尺度目标检测的实用和准确的解决方案。 Our approach, based on FPNs, is substantially faster (our models run at 6 to 7 FPS).我们的方法基于FPN，速度明显加快（我们的模型运行速度为6至7FPS）。
33	hand-engineered (2)	[!≈ hænd 'endʒɪn'ɪərd]	Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25].特征化图像金字塔在手工设计的时代被大量使用[5，25]。 Hand-engineered features and early neural networks.手工设计特征和早期神经网络。
34	higher-level (2)	[!≈ ˈhaɪə(r) ˈlevl]	Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)).除了能够表示更高级别的语义，ConvNets对于尺度变化也更加鲁棒，从而有助于从单一输入尺度上计算的特征进行识别[15，11，29]（图1（b））。 Table 1 (b) shows no advantage over (a), indicating that a single higher-level feature map is not enough because there is a trade-off between coarser resolutions and stronger semantics.表1（b）显示没有优于（a），这表明单个更高级别的特征映射是不够的，因为存在在较粗分辨率和较强语义之间的权衡。
35	variance (2)	[ˈveəriəns]	Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)).除了能够表示更高级别的语义，ConvNets对于尺度变化也更加鲁棒，从而有助于从单一输入尺度上计算的特征进行识别[15，11，29]（图1（b））。 RPN is a sliding window detector with a fixed window size, so scanning over pyramid levels can increase its robustness to scale variance.RPN是一个具有固定窗口大小的滑动窗口检测器，因此在金字塔层级上扫描可以增加其对尺度变化的鲁棒性。
36	featurizing (2)		The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.对图像金字塔的每个层次进行特征化的主要优势在于它产生了多尺度的特征表示，其中所有层次上在语义上都很强，包括高分辨率层。 Nevertheless, featurizing each level of an image pyramid has obvious limitations.尽管如此，特征化图像金字塔的每个层次都具有明显的局限性。
37	in-network (2)	[!≈ ɪn ˈnetwɜ:k]	This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths.这种网内特征层级产生不同空间分辨率的特征映射，但引入了由不同深度引起的较大的语义差异。 In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.换句话说，我们展示了如何创建网络中的特征金字塔，可以用来代替特征化的图像金字塔，而不牺牲表示能力，速度或内存。
38	PASCAL-style (2)	[!≈ 'pæskәl staɪl]	In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16].在消融实验中，我们发现对于边界框提议，FPN将平均召回率（AR）显著增加了8个百分点；对于目标检测，它将COCO型的平均精度（AP）提高了2.3个百分点，PASCAL型AP提高了3.8个百分点，超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。 We evaluate object detection by the COCO-style Average Precision (AP) and PASCAL-style AP (at a single IoU threshold of 0.5).我们通过COCO类型的平均精度（AP）和PASCAL类型的AP（单个IoU阈值为0.5）来评估目标检测。
39	OverFeat (2)		With the development of modern deep ConvNets [19], object detectors like OverFeat [34] and R-CNN [12] showed dramatic improvements in accuracy.随着现代深度卷积网络[19]的发展，像OverFeat[34]和R-CNN[12]这样的目标检测器在精度上显示出了显著的提高。 OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid.OverFeat采用了一种类似于早期神经网络人脸检测器的策略，通过在图像金字塔上应用ConvNet作为滑动窗口检测器。
40	trade-off (2)	[ˈtreɪdˌɔ:f, -ˌɔf]	Recent and more accurate detection methods like Fast R-CNN [11] and Faster R-CNN [29] advocate using features computed from a single scale, because it offers a good trade-off between accuracy and speed.最近更准确的检测方法，如Fast R-CNN[11]和Faster R-CNN[29]提倡使用从单一尺度计算出的特征，因为它提供了精确度和速度之间的良好折衷。 Table 1 (b) shows no advantage over (a), indicating that a single higher-level feature map is not enough because there is a trade-off between coarser resolutions and stronger semantics.表1（b）显示没有优于（a），这表明单个更高级别的特征映射是不够的，因为存在在较粗分辨率和较强语义之间的权衡。
41	FCN (2)	[!≈ ef si: en]	FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations.FCN[24]将多个尺度上的每个类别的部分分数相加以计算语义分割。 Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation.Ghiasi等人[8]为FCN提出拉普拉斯金字塔表示，以逐步细化分割。
42	keypoint (2)	[ki:'pɔɪnt]	There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation.最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。 Recently, FPN has enabled new top results in all tracks of the COCO competition, including detection, instance segmentation, and keypoint estimation.最近，FPN在COCO竞赛的所有方面都取得了新的最佳结果，包括检测，实例分割和关键点估计。
43	upsampled (2)		The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition.然后通过按元素相加，将上采样映射与相应的自下而上映射（其经过1×1卷积层来减少通道维度）合并。 But we argue that the locations of these features are not precise, because these maps have been downsampled and upsampled several times.但是我们认为这些特征的位置并不精确，因为这些映射已经进行了多次下采样和上采样。
44	regressor (2)	[rɪ'gresə(r)]	Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d) in all the feature maps.由于金字塔的所有层都像传统的特征图像金字塔一样使用共享分类器/回归器，因此我们在所有特征映射中固定特征维度（通道数记为d）。 We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels.我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。
45	marginally (2)	[ˈmɑ:dʒɪnəli]	We have experimented with more sophisticated blocks (e.g., using multi-layer residual blocks [16] as the connections) and observed marginally better results.我们已经尝试了更复杂的块（例如，使用多层残差块[16]作为连接）并观察到稍微更好的结果。 Its result (33.4 AP) is marginally worse than that of using all pyramid levels (33.9 AP, Table 2(c)).其结果（33.4 AP）略低于使用所有金字塔等级（33.9 AP，表2（c））的结果。
46	subnetwork (2)		In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a single-scale convolutional feature map, performing object/non-object binary classification and bounding box regression.在原始的RPN设计中，一个小型子网络在密集的3×3滑动窗口，单尺度卷积特征映射上进行评估，执行目标/非目标的二分类和边界框回归。 In [16], a ResNet’s conv5 layers (a 9-layer deep subnetwork) are adopted as the head on top of the conv4 features, but our method has already harnessed conv5 to construct the feature pyramid.在[16]中，ResNet的conv5层（9层深的子网络）被用作conv4特征之上的头部，但我们的方法已经利用了conv5来构建特征金字塔。
47	analogous (2)	[əˈnæləgəs]	This advantage is analogous to that of using a featurized image pyramid, where a common head classifier can be applied to features computed at any image scale.这个优点类似于使用特征图像金字塔的优点，其中可以将常见头部分类器应用于在任何图像尺度下计算的特征。 Analogous to the ResNet-based Faster R-CNN system [16] that uses $C_4$ as the single-scale feature map, we set $k_0$ to 4. Intuitively, Eqn.类似于基于ResNet的Faster R-CNN系统[16]使用$C_4$作为单尺度特征映射，我们将$k_0$设置为4。
48	adaptation (2)	[ˌædæpˈteɪʃn]	With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29].通过上述改编，RPN可以自然地通过我们的FPN进行训练和测试，与[29]中的方式相同。 Based on these adaptations, we can train and test Fast R-CNN on top of the feature pyramid.基于这些改编，我们可以在特征金字塔之上训练和测试Fast R-CNN。
49	canonical (2)	[kəˈnɒnɪkl]	Here 224 is the canonical ImageNet pre-training size, and $k_0$ is the target level on which an RoI with $w\times h=224^2$ should be mapped into.这里224是规范的ImageNet预训练大小，而$k_0$是大小为$w \times h=224^2$的RoI应该映射到的目标级别。 Both the corresponding image region size (light orange) and canonical object size (dark orange) are shown.显示了相应的图像区域大小（浅橙色）和典型目标大小（深橙色）。
50	test-std (2)		We also report final results on the standard test set (test-std) [21] which has no disclosed labels.我们还报告了在没有公开标签的标准测试集（test-std）[21]上的最终结果。 Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [40] on minival).一些在test-std数据集上的结果是不可获得的，因此我们也包括了在test-dev上的结果（和Multipath[40]在minival上的结果）。
51	resize (2)	[ˌri:ˈsaɪz]	The input image is resized such that its shorter side has 800 pixels.输入图像的大小调整为其较短边有800像素。 The input image is resized such that its shorter side has 800 pixels.调整大小输入图像，使其较短边为800像素。
52	synchronize (2)	[ˈsɪŋkrənaɪz]	We adopt synchronized SGD training on 8 GPUs.我们采用8个GPU进行同步SGD训练。 Synchronized SGD is used to train the model on 8 GPUs.同步SGD用于在8个GPU上训练模型。
53	SGD (2)	['esdʒ'i:d'i:]	We adopt synchronized SGD training on 8 GPUs.我们采用8个GPU进行同步SGD训练。 Synchronized SGD is used to train the model on 8 GPUs.同步SGD用于在8个GPU上训练模型。
54	momentum (2)	[məˈmentəm]	We use a weight decay of 0.0001 and a momentum of 0.9.我们使用0.0001的权重衰减和0.9的动量。 We use a weight decay of 0.0001 and a momentum of 0.9.我们使用0.0001的权重衰减和0.9的动量。
55	NVIDIA (2)	[ɪn'vɪdɪə]	With feature sharing, our FPN-based Faster R-CNN system has inference time of 0.148 seconds per image on a single NVIDIA M40 GPU for ResNet-50, and 0.172 seconds for ResNet-101.通过特征共享，我们的基于FPN的Faster R-CNN系统使用ResNet-50在单个NVIDIA M40 GPU上每张图像的推断时间为0.148秒，使用ResNet-101的时间为0.172秒。 ^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40.^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。
56	M40 (2)		With feature sharing, our FPN-based Faster R-CNN system has inference time of 0.148 seconds per image on a single NVIDIA M40 GPU for ResNet-50, and 0.172 seconds for ResNet-101.通过特征共享，我们的基于FPN的Faster R-CNN系统使用ResNet-50在单个NVIDIA M40 GPU上每张图像的推断时间为0.148秒，使用ResNet-101的时间为0.172秒。 ^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40.^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。
57	test-dev (2)	[!≈ test dev]	Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [40] on minival).一些在test-std数据集上的结果是不可获得的，因此我们也包括了在test-dev上的结果（和Multipath[40]在minival上的结果）。 On the test-dev set, our method increases over the existing best results by 0.5 points of AP (36.2 vs. 35.7) and 3.4 points of AP@0.5 (59.1 vs.没有添加额外的东西，我们的单模型提交就已经超越了这些强大的，经过严格设计的竞争对手。
58	InstanceFCN (2)		^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40.^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。 We also report comparisons to DeepMask [27], Sharp-Mask [28], and InstanceFCN [4], the previous state of the art methods in mask proposal generation.我们还报告了与DeepMask[27]，Sharp-Mask[28]和InstanceFCN[4]的比较，这是以前的掩模提议生成中的先进方法。
59	scale-invariant (1)	[!≈ skeɪl ɪnˈveəriənt]	These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid.这些金字塔是尺度不变的，因为目标的尺度变化是通过在金字塔中移动它的层级来抵消的。
60	DPM (1)	[!≈ di: pi: em]	They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave).它们非常关键，以至于像DPM[7]这样的目标检测器需要密集的尺度采样才能获得好的结果（例如每组10个尺度，octave含义参考SIFT特征）。
61	impractical (1)	[ɪmˈpræktɪkl]	Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications.推断时间显著增加（例如，四倍[11]），使得这种方法在实际应用中不切实际。
62	infeasible (1)	[ɪn'fi:zəbl]	Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference.此外，在图像金字塔上端对端地训练深度网络在内存方面是不可行的，所以如果被采用，图像金字塔仅在测试时被使用[15，11，16，35]，这造成了训练/测试时推断的不一致性。
63	inconsistency (1)	[ˌɪnkən'sɪstənsɪ]	Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference.此外，在图像金字塔上端对端地训练深度网络在内存方面是不可行的，所以如果被采用，图像金字塔仅在测试时被使用[15，11，16，35]，这造成了训练/测试时推断的不一致性。
64	subsampling (1)		A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape.深层ConvNet逐层计算特征层级，而对于下采样层，特征层级具有内在的多尺度金字塔形状。
65	Ideally (1)	[aɪ'di:əlɪ]	Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost.理想情况下，SSD风格的金字塔将重用正向传递中从不同层中计算的多尺度特征映射，因此是零成本的。
66	SSD-style (1)		Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost.理想情况下，SSD风格的金字塔将重用正向传递中从不同层中计算的多尺度特征映射，因此是零成本的。
67	forego (1)	[fɔ:ˈɡəu]	But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4_3 of VGG nets [36]) and then by adding several new layers.但为了避免使用低级特征，SSD放弃重用已经计算好的图层，而从网络中的最高层开始构建金字塔（例如，VGG网络的conv4_3[36]），然后添加几个新层。
68	higher-resolution (1)	[!≈ ˈhaɪə(r) ˌrezəˈlu:ʃn]	Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy.因此它错过了重用特征层级的更高分辨率映射的机会。
69	heavily-engineered (1)	[!≈ ˈhevɪli 'endʒɪn'ɪərd]	Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners.没有任何不必要的东西，我们在具有挑战性的COCO检测基准数据集上报告了最新的单模型结果，仅仅基于FPN和基本的Faster R-CNN检测器[29]，就超过了竞赛获奖者所有现存的严重工程化的单模型竞赛输入。
70	consistently (1)	[kən'sɪstəntlɪ]	In addition, our pyramid structure can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids.另外，我们的金字塔结构可以通过所有尺度进行端对端培训，并且在训练/测试时一致地使用，这在使用图像金字塔时是内存不可行的。
71	memory-infeasible (1)	[!≈ ˈmeməri ɪn'fi:zəbl]	In addition, our pyramid structure can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids.另外，我们的金字塔结构可以通过所有尺度进行端对端培训，并且在训练/测试时一致地使用，这在使用图像金字塔时是内存不可行的。
72	scale-space (1)	[!≈ skeɪl speɪs]	SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching.SIFT特征[25]最初是从尺度空间极值中提取的，用于特征点匹配。
73	extrema (1)	[ɪks'tri:mə]	SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching.SIFT特征[25]最初是从尺度空间极值中提取的，用于特征点匹配。
74	sparsely (1)	[spɑ:slɪ]	Dollar et al. [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels.Dollar等人[6]通过先计算一个稀疏采样（尺度）金字塔，然后插入缺失的层级，从而演示了快速金字塔计算。
75	interpolate (1)	[ɪnˈtɜ:pəleɪt]	Dollar et al. [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels.Dollar等人[6]通过先计算一个稀疏采样（尺度）金字塔，然后插入缺失的层级，从而演示了快速金字塔计算。
76	scale-normalized (1)	[!≈ skeɪl 'nɔ:məlaɪzd]	R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet.R-CNN采用了基于区域提议的策略[37]，其中每个提议在用ConvNet进行分类之前都进行了尺度归一化。
77	SPPnet (1)		SPPnet [15] demonstrated that such region-based detectors could be applied much more efficiently on feature maps extracted on a single image scale.SPPnet[15]表明，这种基于区域的检测器可以更有效地应用于在单个图像尺度上提取的特征映射。
78	hypercolumn (1)	[haɪpə'kɒləm]	Hypercolumns [13] uses a similar method for object instance segmentation.Hypercolumns[13]使用类似的方法进行目标实例分割。
79	HyperNet (1)		Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features.在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。
80	ParseNet (1)		Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features.在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。
81	ION (1)	[ˈaɪən]	Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features.在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。
82	concatenate (1)	[kɒn'kætɪneɪt]	Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features.在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。
83	MS-CNN (1)		SSD [22] and MS-CNN [3] predict objects at multiple layers of the feature hierarchy without combining features or scores.SSD[22]和MS-CNN[3]可预测特征层级中多个层的目标，而不需要组合特征或分数。
84	U-Net (1)	[!≈ ju: net]	There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation.最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。
85	Recombinator (1)	[riːkəm'bɪnətə]	There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation.最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。
86	Hourglass (1)	[ˈaʊəglɑ:s]	There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation.最近有一些方法利用横向/跳跃连接将跨分辨率和语义层次的低级特征映射关联起来，包括用于分割的U-Net[31]和SharpMask[28]，Recombinator网络[17]用于人脸检测以及Stacked Hourglass网络[26]用于关键点估计。
87	Ghiasi (1)		Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation.Ghiasi等人[8]为FCN提出拉普拉斯金字塔表示，以逐步细化分割。
88	Laplacian (1)	[lɑ:'plɑ:siәn]	Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation.Ghiasi等人[8]为FCN提出拉普拉斯金字塔表示，以逐步细化分割。
89	progressively (1)	[prəˈgresɪvli]	Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation.Ghiasi等人[8]为FCN提出拉普拉斯金字塔表示，以逐步细化分割。
90	general-purpose (1)	['dʒenrəl 'pɜ:pəs]	The resulting Feature Pyramid Network is general-purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11].由此产生的特征金字塔网络是通用的，在本文中，我们侧重于滑动窗口提议（Region Proposal Network，简称RPN）[29]和基于区域的检测器（Fast R-CNN）[11]。
91	proposer (1)	[prəˈpəʊzə(r)]	The resulting Feature Pyramid Network is general-purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11].由此产生的特征金字塔网络是通用的，在本文中，我们侧重于滑动窗口提议（Region Proposal Network，简称RPN）[29]和基于区域的检测器（Fast R-CNN）[11]。
92	Sec.6. (1)		We also generalize FPNs to instance segmentation proposals in Sec.6.在第6节中我们还将FPN泛化到实例细分提议。
93	arbitrary (1)	[ˈɑ:bɪtrəri]	Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion.我们的方法以任意大小的单尺度图像作为输入，并以全卷积的方式输出多层适当大小的特征映射。
94	proportionally (1)	[prə'pɔ:ʃənlɪ]	Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion.我们的方法以任意大小的单尺度图像作为输入，并以全卷积的方式输出多层适当大小的特征映射。
95	feed-forward (1)	['fi:df'ɔ:wəd]	The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2.自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。
96	hallucinate (1)	[həˈlu:sɪneɪt]	The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels.自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。
97	spatially (1)	['speɪʃəlɪ]	The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels.自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。
98	lower-level (1)	[!≈ ˈləʊə(r) ˈlevl]	The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。
99	localized (1)	[ˈləʊkəlaɪzd]	The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。
100	subsampled (1)		The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。
101	coarser-resolution (1)	[!≈ kɔ:sə ˌrezəˈlu:ʃn]	With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity).使用较粗糙分辨率的特征映射，我们将空间分辨率上采样为2倍（为了简单起见，使用最近邻上采样）。
102	iterated (1)	[ˈɪtəˌreɪtid]	This process is iterated until the finest resolution map is generated.迭代这个过程，直到生成最佳分辨率映射。
103	append (1)	[əˈpend]	Finally, we append a 3 × 3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling.最后，我们在每个合并的映射上添加一个3×3卷积来生成最终的特征映射，这是为了减少上采样的混叠效应。
104	alias (1)	[ˈeɪliəs]	Finally, we append a 3 × 3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling.最后，我们在每个合并的映射上添加一个3×3卷积来生成最终的特征映射，这是为了减少上采样的混叠效应。
105	non-linearity (1)	['nɒnlaɪn'ərɪtɪ]	There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.在这些额外的层中没有非线性，我们在实验中发现这些影响很小。
106	empirically (1)	[ɪm'pɪrɪklɪ]	There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.在这些额外的层中没有非线性，我们在实验中发现这些影响很小。
107	minor (1)	[ˈmaɪnə(r)]	There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.在这些额外的层中没有非线性，我们在实验中发现这些影响很小。
108	minimal (1)	[ˈmɪnɪməl]	To demonstrate the simplicity and effectiveness of our method, we make minimal modifications to the original systems of [29, 11] when adapting them to our feature pyramid.为了证明我们方法的简洁性和有效性，我们对[29，11]的原始系统进行最小修改，使其适应我们的特征金字塔。
109	class-agnostic (1)	[!≈ klɑ:s ægˈnɒstɪk]	RPN [29] is a sliding-window class-agnostic object detector.RPN[29]是一个滑动窗口类不可知的目标检测器。
110	Intersection-over-Union (1)	[!≈ ˌɪntəˈsekʃn ˈəʊvə(r) ˈju:niən]	We assign training labels to the anchors based on their Intersection-over-Union (IoU) ratios with ground-truth bounding boxes as in [29].如[29]，我们根据锚点和实际边界框的交并比（IoU）比例将训练标签分配给锚点。
111	finer-resolution (1)	[!≈ 'faɪnə ˌrezəˈlu:ʃn]	(1) means that if the RoI’s scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, $k=3$).直觉上，方程（1）意味着如果RoI的尺寸变小了（比如224的1/2），它应该被映射到一个更精细的分辨率级别（比如k=3）。
112	predictor (1)	[prɪˈdɪktə(r)]	We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels.我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。
113	class-specific (1)	[!≈ klɑ:s spəˈsɪfɪk]	We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels.我们在所有级别的所有RoI中附加预测器头部（在Fast R-CNN中，预测器头部是特定类别的分类器和边界框回归器）。
114	harness (1)	[ˈhɑ:nɪs]	In [16], a ResNet’s conv5 layers (a 9-layer deep subnetwork) are adopted as the head on top of the conv4 features, but our method has already harnessed conv5 to construct the feature pyramid.在[16]中，ResNet的conv5层（9层深的子网络）被用作conv4特征之上的头部，但我们的方法已经利用了conv5来构建特征金字塔。
115	ImageNet1k (1)		As is common practice [12], all network backbones are pre-trained on the ImageNet1k classification set [33] and then fine-tuned on the detection dataset.正如通常的做法[12]，所有的网络骨干都是在ImageNet1k分类集[33]上预先训练好的，然后在检测数据集上进行微调。
116	reimplementation (1)		Our code is a reimplementation of py-faster-rcnn using Caffe2.我们的代码是使用Caffe2重新实现py-faster-rcnn。
117	py-faster-rcnn (1)		Our code is a reimplementation of py-faster-rcnn using Caffe2.我们的代码是使用Caffe2重新实现py-faster-rcnn。
118	Caffe (1)		Our code is a reimplementation of py-faster-rcnn using Caffe2.我们的代码是使用Caffe2重新实现py-faster-rcnn。
119	enrichment (1)	[ɪn'rɪtʃmənt]	How important is top-down enrichment? Table 1(d) shows the results of our feature pyramid without the top-down pathway.自上而下的改进有多重要？表1（d）显示了没有自上而下路径的特征金字塔的结果。
120	par (1)	[pɑ:(r)]	The results in Table 1(d) are just on par with the RPN baseline and lag far behind ours.表1（d）中的结果与RPN基线相当，并且远远落后于我们的结果。
121	lag (1)	[læg]	The results in Table 1(d) are just on par with the RPN baseline and lag far behind ours.表1（d）中的结果与RPN基线相当，并且远远落后于我们的结果。
122	conjecture (1)	[kənˈdʒektʃə(r)]	We conjecture that this is because there are large semantic gaps between different levels on the bottom-up pyramid (Fig. 1(b)), especially for very deep ResNets.我们推测这是因为自下而上的金字塔（图1（b））的不同层次之间存在较大的语义差距，尤其是对于非常深的ResNets。
123	level-specific (1)	[!≈ ˈlevl spəˈsɪfɪk]	This issue cannot be simply remedied by level-specific heads.这个问题不能简单地由特定级别的负责人来解决。
124	downsampled (1)		But we argue that the locations of these features are not precise, because these maps have been downsampled and upsampled several times.但是我们认为这些特征的位置并不精确，因为这些映射已经进行了多次下采样和上采样。
125	highest-resolution (1)	[!≈ haɪɪst ˌrezəˈlu:ʃn]	How important are pyramid representations? Instead of resorting to pyramid representations, one can attach the head to the highest-resolution, strongly semantic feature maps of $P_2$ (i.e., the finest level in our pyramids).金字塔表示有多重要？可以将头部附加到$P_2$的最高分辨率的强语义特征映射上（即我们金字塔中的最好层级），而不采用金字塔表示。
126	i.e. (1)	[ˌaɪ ˈi:]	How important are pyramid representations? Instead of resorting to pyramid representations, one can attach the head to the highest-resolution, strongly semantic feature maps of $P_2$ (i.e., the finest level in our pyramids).金字塔表示有多重要？可以将头部附加到$P_2$的最高分辨率的强语义特征映射上（即我们金字塔中的最好层级），而不采用金字塔表示。
127	orthogonal (1)	[ɔ:'θɒgənl]	It gets an AP of 28.8, indicating that the 2-fc head does not give us any orthogonal advantage over the baseline in Table 2(a).它得到了28.8的AP，表明2-fc头部没有给我们带来任何超过表2（a）中基线的正交优势。
128	sub-section (1)	['sʌbs'ekʃn]	Table 2(d) and (e) show that removing top-down connections or removing lateral connections leads to inferior results, similar to what we have observed in the above sub-section for RPN.表2（d）和（e）表明，去除自上而下的连接或去除横向连接会导致较差的结果，类似于我们在上面的RPN小节中观察到的结果。
129	noteworthy (1)	[ˈnəʊtwɜ:ði]	It is noteworthy that removing top-down connections (Table 2(d)) significantly degrades the accuracy, suggesting that Fast R-CNN suffers from using the low-level features at the high-resolution maps.值得注意的是，去除自上而下的连接（表2（d））显著降低了准确性，表明Fast R-CNN在高分辨率映射中使用了低级特征。
130	warping-like (1)	[!≈ 'wɔ:pɪŋ laɪk]	We argue that this is because RoI pooling is a warping-like operation, which is less sensitive to the region’s scales.我们认为这是因为RoI池化是一种扭曲式的操作，对区域尺度较不敏感。
131	reproduction (1)	[ˌri:prəˈdʌkʃn]	Table 3(a) shows our reproduction of the baseline Faster R-CNN system as described in [16].表3（a）显示了我们再现[16]中描述的Faster R-CNN系统的基线。
132	convergence (1)	[kən'vɜ:dʒəns]	We find the following implementations contribute to the gap: (i) We use an image scale of 800 pixels instead of 600 in [11, 16]; (ii) We train with 512 RoIs per image which accelerate convergence, in contrast to 64 RoIs in [11, 16]; (iii) We use 5 scale anchors instead of 4 in [16] (adding $32^2$); (iv) At test time we use 1000 proposals per image instead of 300 in [16].我们发现以下实现有助于缩小差距：（i）我们使用800像素的图像尺度，而不是[11，16]中的600像素；（ii）与[11，16]中的64个ROI相比，我们训练时每张图像有512个ROIs，可以加速收敛；（iii）我们使用5个尺度的锚点，而不是[16]中的4个（添加$32^2$）；（iv）在测试时，我们每张图像使用1000个提议，而不是[16]中的300个。
133	FPN-based (1)		With feature sharing, our FPN-based Faster R-CNN system has inference time of 0.148 seconds per image on a single NVIDIA M40 GPU for ResNet-50, and 0.172 seconds for ResNet-101.通过特征共享，我们的基于FPN的Faster R-CNN系统使用ResNet-50在单个NVIDIA M40 GPU上每张图像的推断时间为0.148秒，使用ResNet-101的时间为0.172秒。
134	leaderboard (1)	['li:dərbɔ:d]	This model is the one we submitted to the COCO detection leaderboard, shown in Table 4.该模型是我们提交给COCO检测排行榜的模型，如表4所示。
135	feature-sharing (1)	[!≈ ˈfi:tʃə(r) 'ʃeərɪŋ]	We have not evaluated its feature-sharing version due to limited time, which should be slightly better as implied by Table 5.由于时间有限，我们尚未评估其特征共享版本，这应该稍微好一些，如表5所示。
136	Multipath (1)	['mʌltɪpæθ]	Some results were not available on the test-std set, so we also include the test-dev results (and for Multipath [40] on minival).一些在test-std数据集上的结果是不可获得的，因此我们也包括了在test-dev上的结果（和Multipath[40]在minival上的结果）。
137	AttractioNet (1)		^§: This entry of AttractioNet [10] adopts VGG-16 for proposals and Wide ResNet [39] for object detection, so is not strictly a single-model result.^§：AttractioNet[10]的输入采用VGG-16进行目标提议，用Wide ResNet[39]进行目标检测，因此它不是严格意义上的单模型。
138	G-RMI (1)		Table 4 compares our method with the single-model results of the COCO competition winners, including the 2016 winner G-RMI and the 2015 winner Faster R-CNN+++. Without adding bells and whistles, our single-model entry has surpassed these strong, heavily engineered competitors.表4将我们方法的单模型结果与COCO竞赛获胜者的结果进行了比较，其中包括2016年冠军G-RMI和2015年冠军Faster R-CNN+++。
139	small-scale (1)	[ˈsmɔ:lˈskeɪl]	It is worth noting that our method does not rely on image pyramids and only uses a single input image scale, but still has outstanding AP on small-scale objects.值得注意的是，我们的方法不依赖图像金字塔，只使用单个输入图像尺度，但在小型目标上仍然具有出色的AP。
140	iterative (1)	['ɪtərətɪv]	Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further.此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。
141	mining (1)	[ˈmaɪnɪŋ]	Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further.此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。
142	augmentation (1)	[ˌɔ:ɡmen'teɪʃn]	Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further.此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。
143	complementary (1)	[ˌkɒmplɪˈmentri]	Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further.此外，我们的方法没有利用许多流行的改进，如迭代回归[9]，难例挖掘[35]，上下文建模[16]，更强大的数据增强[22]等。这些改进与FPN互补，应该会进一步提高准确度。
144	convolutionally (1)	[!≈ kɒnvə'lu:ʃənəli]	At inference time, these models are run convolutionally to generate dense proposals in an image.在推断时，这些模型是卷积运行的，以在图像中生成密集的提议。
145	setup (1)	['setʌp]	We use a fully convolutional setup for both training and inference.我们对训练和推断都使用全卷积设置。
146	Additionally (1)	[ə'dɪʃənəlɪ]	Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27, 28], we use a second MLP of input size 7×7 to handle half octaves.此外，由于在[27,28]的图像金字塔中每组使用2个尺度，我们使用输入大小为7×7的第二个MLP来处理半个组。
147	Instance-FCN (1)		DeepMask, SharpMask, and FPN use ResNet-50 while Instance-FCN uses VGG-16.DeepMask，SharpMask和FPN使用ResNet-50，而Instance-FCN使用VGG-16。
148	zoom (1)	[zu:m]	DeepMask and SharpMask performance is computed with models available from https://github.com/facebookresearch/deepmask (both are the ‘zoom’ variants).DeepMask和SharpMask性能计算的模型是从https://github.com/facebookresearch/deepmask上获得的（都是‘zoom’变体）。
149	runtime (1)	[rʌn'taɪm]	^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40.^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。
150	K40 (1)		^† Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40.^†运行时间是在NVIDIA M40 GPU上测量的，除了基于较慢的K40的InstanceFCN。
151	Sharp-Mask (1)	[!≈ ʃɑ:p mɑ:sk]	We also report comparisons to DeepMask [27], Sharp-Mask [28], and InstanceFCN [4], the previous state of the art methods in mask proposal generation.我们还报告了与DeepMask[27]，Sharp-Mask[28]和InstanceFCN[4]的比较，这是以前的掩模提议生成中的先进方法。
152	computationally (1)	[!≈ ˌkɒmpjuˈteɪʃənli]	Existing mask proposal methods [27, 28, 4] are based on densely sampled image pyramids (e.g., scaled by 2^{\lbrace −2:0.5:1 \rbrace} in [27, 28]), making them computationally expensive.现有的掩码提议方法[27，28，4]是基于密集采样的图像金字塔的（例如，[27，28]中的缩放为2^{\lbrace −2:0.5:1 \rbrace}），使得它们是计算昂贵的。
153	substantially (1)	[səbˈstænʃəli]	Our approach, based on FPNs, is substantially faster (our models run at 6 to 7 FPS).我们的方法基于FPN，速度明显加快（我们的模型运行速度为6至7FPS）。

Words List (appearance)

Words List (frequency)